MongoDB – Sharding

Sharding is the process of storing data records across multiple nodes when having demands of data growth. MongoDB is solves the problem with horizontal scaling by using the sharding mechanism.

Compared with replica set there are some changes from the client perspective. The client no longer talks with mongod instances directly. Instead a new component mongos is added. Mongos is aware where the data is stored after the shard (data partition) and it’s routing the queries to the appropriate mongod proceses. Usually mongos runs alongaside with the application client or on a very light environment as it’s only job is to route the queries.

Mongos is using config servers (at least 3 servers to be deployed for reliability) to retrieve metadata about sharded data. Each config server is storing the same configuration.

MongoDB 3.2 deprecates the use of three mirrored mongod instances for config servers.

Starting in MongoDB 3.2, config servers for sharded clusters can be deployed as a replica set. The replica set config servers must run the WiredTiger storage engine.

A complete replica set is required for each shard to provide reliability and fail-over at the data level.

The config servers will ensure an even distribution of data across shards.

Shard key – it is a field (compound field) in a document that will be used to partition the data.

Shard key selection:

  • the shard key is the common in queries for the collection
  • good “cardinality” / granularity
  • consider compound key shard keys
  • is the key monotonic increasing ? (eg. _id, timestamp)

I’m going to use screen under the xubuntu.

The output should look like below:

Let’s connect to the node rs1 and configure replica set rs0:

The output for rs.status() should look like below:

We’re going to import data into rs0, primary node:

Let’s create the second replica set  rs1 by following the same steps:

The output so far:

Now let’s configure replica set rs1 buy connecting to it’s primary node:

The output is similar to replica set rs0:

Now it’s the time to create the config servers:

The output should be similar to below:

Not it’s time to start the mongos server. This will be used by all clients to query our mongoDB sharded platform.

Before sharding the collection we create the index for ShardingKey:

We’re ready to shard out collection by request_ip and _id.

The initial distribution of data:

At this stage rs0 is keeping all our data. Try to generate some additional data to see the behavior of the shard.

Best practices on sharding:

  • only shard the big collections
  • pick sharding key carefully, they aren’t easily changeable ( it require creation of a new collection and data copy)
  • consider pre-split if bulk inserts
  • be aware of monotonically increasing shard keys (eg: timestamp, _id)
  • adding shards is easy but isn’t instantaneous
  • always connect to mongos instance and use mongod only for some dba operations when you want to talk directly with a certain server
  • keep non-mongos processes off of 27017 to avoid mistakes
  • use logical server names for config servers instead of IPs and hostnames.

References:

1.http://docs.mongodb.org/manual/sharding/

2.https://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/