Towards Auto-sharding in Your Node.js App

23 Aug

Introduction

I’ve made a node.js 3rd party module to help you auto-shard your data to any number or type of data stores using an algorithm known as consistent hashing. If you understand that sentence, you can skip this article and check out the node-hash-ring module on github. If you want to read more, I’ve divided this post into the following sections:

  1. First, I’ll explain sharding and why you (eventually) need it.
  2. Then, I’ll describe consistent hashing as one approach to sharding.
  3. Finally, I’ll introduce my new node module (node-hash-ring).

Sharding

As a hot startup, you discover that your database is starting to slow down and/or that your memcache instance is no longer large enough to store what you ideally want in memory. You can solve this by adding a 2nd server and then letting half your data live on server 1 and the other half live on server 2. This is sharding.

In order to shard your data across 2 or more servers, you need to choose how to pick the right server for a particular key. For example, if you’re updating a user with an id of 101, you need to choose an algorithm that will determine on which server user 101 is stored or held in memory. There are several strategies to do this [1]. For example, you could keep a lookup table of all user ids and their corresponding servers. Another popular approach uses a concept known as consistent hashing.

Consistent Hasing

A naive way to shard data is to mod the incoming id with the number of servers. For instance, with 6 servers, a user with id 9 is sharded to server 3 (9 % 6 = 3), while a user with id 12 is sharded to server 0 (12 % 6 = 0). This will work…that is, until you need to add another server. Then, a lot of your mapping is invalidated. Now, with 7 instead of 6 servers:

  • User 6 now shards to server 6 instead of server 0.
  • User 7 now shards to server 0 instead of server 1.
  • User 8 now shards to server 1 instead of server 2.
  • User 9 now shards to server 2 instead of server 3.
  • etc. (you get the point)

As you see, you’ll have to end up moving around a lot of data so that when you lookup user 9, it’ll be where you expect it to be (on server 2).

Consistent hashing to the rescue. It enables you to grow or shrink your server cluster with a LOT LESS re-mapping of data. This is how it works:

  • Setup
    1. Imagine a clock face (aka a “continuum”, aka a “hash ring”) whose edge contains points between 0 and (e.g.) 32767.
    2. For each server in your list, hash EACH server to several “virtual points” on that ring. The result is a circle with numbered virtual points that link back to the server from which it was hashed.
  • Mapping a key (e.g., user id 9) to a server
    1. Hash your key onto the same ring using the same hashing function that mapped the servers onto the ring. This produces a number that you can pinpoint on the ring.
    2. Find the next biggest number on the ring that’s a virtual point.
    3. The server linked to from that virtual point is the server to which the key maps.

How does this reduce the amount of data that you have to remap whenever you expand or contract your cloud? I’ve borrowed the following images from Tom White’s consistent hashing post in order to illustrate exactly how. For simplicity, we’ll assume each server is hashed to only one virtual point.

Suppose you have 3 servers (A, B, C) and 4 keys of interest (1, 2, 3, 4). If they hash according to the image below, then 2 will belong to B; 3 will belong to C; 4 and 1 will belong to A.

Now suppose you remove server C. This is what occurs in the below image. Then, the only data that has to be re-mapped is key 3, which now belongs to server D.

Simple enough, right? Except, if we leave this example, and move to a real implementation of consistent hashing, we hash each server to more than one virtual point. Why do we do that? Because it distributes the incoming keys more evenly across all servers. For example, in the most recent image, you can see geometrically that the arc B->D is longer than the arc D->A, and arc d->A is longer than the arc A->B. Therefore, if we assume over the long-run that keys will map uniformly over the ring, then more points (and therefore keys) will end up on D than on A, and more points will end up on A than on B. To avoid this unequal distribution, we hash each server to over 100 virtual points. In practice, this achieves an even distribution of keys among servers.

node-hash-ring

node-hash-ring brings consistent hashing to node.js in an easy-to-use module. I’ve implemented it as a C++ add-on — in other words, it’s a node.js wrapper around C++ code. Here’s an example:

var HashRing = require(“./hash_ring”);
var ring = new HashRing({
“127.0.0.1:8080″:1,
“127.0.0.2:8080″: 2,
“127.0.0.3:8080″:1
});
var key = “users:101″;
var serverForKey = ring.getNode(key);

You instantiate a HashRing instance with a hash that lists the servers and their weights. Weights work in the following way. Server 127.0.0.2:8080 will store approximately 2 times as much data as either of the other 2 servers listed. Weights are useful in scenarios where not all servers in your cluster have similar power (e.g., 127.0.0.2 might have 2x the memory as either 127.0.0.1 and 127.0.0.3).

My implementation is based on:

I’ll be releasing a library that depends on this soon, my anticipated redis client for node.js (that will support transactions, pipelining, and a distributed api), so stay tuned to see how node-hash-ring is used with a database. If you end up using this in your node project, let me know in the comments below, and I can link to implementations that use this from my post. Happy coding!

[1] Max Indelicatos wrote an excellent guide covering sharding strategies on his blog.

A Must Have Redis Patch

15 Jul

(or “how to get the whole hash with SORT…BY…GET)

Suppose you’ve built Reddit on top of Redis. Using Redis’ built in hash data structure, you might store a news item under the key “items:78″ where 78 is the unique id of that news item:

“items:78″: { id: 78, title: “My Blog”, url: “http://ngchi.wordpress.com”}

You would then store your home page as a Redis list data structure of these unique ids under the key “home.page”:

“home.page”: [125, 132, 143, 113, 78]   // References the top 5 links

From your application code, you’ll want an array of the home page’s top 5 articles in the following easy-consumption format:

[ { id: 125, title: "Redis is an awesome database", url: "http://code.google.com/p/redis" }, ..., { id: 78, title: "My Blog", url: "http://ngchi.wordpress.com" } ]

With the current Redis release, you’ll have to invoke 3 different commands to retrieve the id, title, and url data respectively, and then you’ll have to arrange them into the desired array at the application level:

SORT home.page BY nosort GET items:*->id

SORT home.page BY nosort GET items:*->title

SORT home.page BY nosort GET items:*->title

For hashes that each have N different key/value pairs, this means we need to invoke N commands to grab N lists of items! That’s unnecessary work. All we really want is 1 command to grab 1 list of items, each with N different key/value pairs. We can do better.

Hence, I patched Redis so that it does in 1 command what it took 3 commands above to do. The command we all want is now thus:

SORT home.page BY nosort GET items:*

This was a single patch to the Redis server code. But we’ll also need to patch the client code to interpret this new interpretation of the Redis protocol when using SORT. With the right client implementation, we can retrieve the desired array of hashes from our application. I therefore also made a corresponding patch in Ezra’s redis-rb Ruby Redis client.

You can give the new capability a try by downloading my patched redis branch and patched redis-rb (Redis ruby client) branch here:

See this gist for more details.

I’ll be making pull requests to have these patches integrated into the main branch of Redis and then redis-rb. As a disclaimer, this is my first time hacking a large C project in a while, so use at your own risk, and let me know if you uncover any bugs/memory leaks that my patch causes.

Follow

Get every new post delivered to your Inbox.