What is sharding?
Sharding is a crucial technique for database performance optimization, helping to improve the scalability of a distributed system. But what is sharding exactly, and how does sharding work? We'll go over all the details in this article.
What is sharding? How does sharding work?
Sharding is a database architecture pattern that splits a single database into smaller tables known as "shards", each one stored on a separate node. Each database partition is known as a "logical shard", and its storage within a node is known as a "physical shard."
Why should you use sharding?
Why is sharding necessary? Dividing a database into multiple shards helps with scalability and availability. In theory, sharding is a repeatable practice, helping you continue to horizontally scale the database indefinitely.
Applications and websites that suddenly experience higher levels of traffic must adequately handle this increased demand - without breaking under the pressure. However, databases that reside on a single machine will eventually hit a physical limit for how many queries they can handle, or how much data they can store. Techniques such as sharding enable you to balance and scale the load across multiple machines.
Sharding also improves the availability of your databases and applications. If one machine goes down, only the shard on that particular machine will be inaccessible; the other shards on separate machines will continue operating as normal. What's more, you can back up each shard as a failsafe to further improve the database's availability.
Note that sharding does introduce certain difficulties and complexities. For example, database joins become more expensive because of the greater network latency between multiple machines.
What's the difference between sharding, partitioning, and replication?
If you've read the explanation above, you may wonder: "What's the difference between sharding and partitioning?" Sharding, partitioning, and replication are similar concepts, but with important differences between them. In fact, sharding may be considered a special class of partitioning.
Partitioning is defined as any division of a database into distinct parts, usually for reasons such as better performance and ease of management. Depending on the situation, you may use horizontal partitioning or vertical partitioning:
- Horizontal partitioning: Each partition uses the same database schema and has the same columns, but contains different rows.
- Vertical partitioning: Each partition is a proper subset of the original database schema - i.e. it contains all of the rows, but only a subset of the original columns.
Sharding is usually a case of horizontal partitioning. There are multiple possible sharding schemes to determine how to partition the data in a database:
- Range-based sharding: The database is sharded based on a certain value, such as name or ID number. For example, a database of university students may be sharded based on the first letter of their last name.
- Hash-based sharding: This technique is used for key-value databases. The key of each database entry is passed through a hash function, which generates a result that determines which shard the entry will be assigned to.
"Replication", meanwhile, is simply a term for copying or backing up the information in a database to another location. Sharding and replication are separate, but complementary, strategies for improving database availability. For example, each shard can also be replicated to a backup database in the event the primary shard goes down.
Sharding in Redis
Redis is an open-source, in-memory data structure store that is frequently used to implement key-value databases and caches. Sharding is an essential technique for improving the scalability and availability of Redis deployments. Even though Redis is a non-relational database, sharding is still possible by distributing disjoint subsets of the database keys into separate shards.
Because Redis isn't compatible with the Java programming language out of the box, many Redis developers use a third-party Redis Java client such as Redisson. The good news is that the Redisson PRO edition supports sharding for Java objects in Redis.
By default, Redisson splits data across 231 partitions, with a minimum of 3 partitions. Redisson attempts to evenly distribute these partitions across all Redis cluster nodes. For example, if you have 231 partitions and 4 master nodes in a Redis cluster, Redisson will distribute roughly 57 partitions to each node.