GlareDB | Workload Sharding

In v0.8.0, we added a few features to GlareDB to support what I've been thinking of as "workload sharding." This is a feature that grew out of some work we were doing internally using GlareDB, and since we found this pattern useful, we thought you might as well.

Here's the situation: you're running multiple copies of a stateless application that needs to periodically process some data from GlareDB, and you want to be able to ensure that each of the application servers can process an equal share of the data and that all the data is only processed once.

There are lots of ways, in theory, to solve this problem, such as a shared queue, a distributed lock, or a communication protocol between the applications. But all of these are complex and hard to get right or require additional infrastructure to support the application tier. We could imagine some range based partitioning solution, where each application server gets its own range of values in the result set. This would have no extra infrastructure, but these schemes are also difficult because it is hard to choose ranges that evenly distribute the data and challenging to manage the assignment of ranges to application servers.

Consider the following:

compute a hash for a field in the dataset with high cardinality: a user id or other internal identifier, timestamp, or similar. Render the hash into an unsigned 64 bit integer.
modulo the hash value by the number (n) of divisions or workers you want to have. Every record or result will have one value of n values [0-n].
every worker should have an ID in the set [0-n), and you can select results based on the modulo of the hash.

In another form, the operation looks something like:

hash(value) % num_workers == worker_id

This has a few good properties:

even distribution of work among the workers with fairly effective shuffling of the work,
no coordination between workers is required, and aside from knowing their ID, individual workers can be very oblivious to the sharding, and
if the value you hash is stable (like an account id), then the assignment of work to workers doesn't change over time.

There are also a couple of caveats:

changing the size of the worker pool gets a bit harder than it would be otherwise, since adding or removing a worker it impacts the configuration of all workers.
"pinning" work to specific workers means that problems with specific workers will always affect the same portion of the workload. If one worker crashes or has a hardware bug, then the work won't be redistributed.

These properties make sense. It's not the only way to partition a workload, but it's a pretty good one.

You can implement this in your application; but if you do, then every application runs the same query, gets the full result set, and then ignores the majority of the results. While it's hard to avoid doing the work for the query on the database end, by having the database do the sharding calculation, you can meaningfully reduce the application's work.

In support of this pattern, we added fnv() (64bit, 1a) and siphash() (2-4) functions to GlareDB, both of which return uint64 values.

While these two functions are enough to implement this sharding scheme, we also added a partition_results function to GlareDB to make things even easier. The function returns a boolean, so you can use it in a WHERE clause, and the signature looks like:

partition_results(<value>, <num_partitions>, <partition_id>)

You can specify any scalar as the value value or a column holding a scalar value and it does the right thing, along with validating that the number of partitions is greater than the partition ID. You can see it at work in a query, as in:

SELECT u.account_id, u.access_token, messages.*
FROM users AS u
JOIN messages ON (messages.user_id = u.account_id)
WHERE (u.active = true
  	   AND u.deleted = false
       AND messages.sent = false
	   AND messages.destination IS NOT NULL
	   AND partition_results(u.account_id, 3, 1));

We hope you find it as useful as we do!

Workload Sharding

Subscribe to the changelog