Sharding in YugabyteDB: Understanding Hash and Range Partitioning

Sharding is a technique used to horizontally partition data across multiple servers to improve performance and scalability. It allows for a more efficient use of resources, improved scalability and better performance when working with large amounts of data. YugabyteDB supports two types of sharding: hash-based and range-based. In this post, we will explore the differences between the two and provide examples of when to use each one.

On this page

Hash-based Sharding

Hash-based sharding is the default sharding method in YugabyteDB
It distributes data evenly across multiple servers by applying a hash function to the partition key
Hash sharding distributes data uniformly across all tablets, using a hash function to determine the tablet for a given piece of data. This strategy is useful for workloads that have a high degree of write and read requests and can handle a uniform data distribution. Hash sharding is the default sharding strategy in YugabyteDB.
To illustrate this in context of number of CPU, nodes, and hardware, consider a YugabyteDB cluster with 3 nodes, each node has 4 cores, and a total of 12 cores. With hash sharding, the data is evenly distributed across all the nodes and tablets, ensuring that the load is balanced across all the cores.

The result of the hash function determines which server the data is stored on
This method is useful for use cases where the distribution of data is unknown or unpredictable
Example:
- CREATE TABLE orders (order_id INT, customer_name VARCHAR(255), PRIMARY KEY (order_id)) WITH shard_key = HASH(order_id);

Range-based Sharding

Range-based sharding allows for explicit control over the split points for data partitioning
The user specifies the split points, and the data is divided into ranges based on the partition key
It allows for explicit control over the split points and is useful for workloads that require efficient range scans. With range sharding, data is partitioned into a set of continuous ranges, with each range being assigned to a specific tablet. This strategy is useful for workloads that have a high degree of range scans and can handle a non-uniform data distribution.
To illustrate this in context of number of CPU, nodes, and hardware, consider a YugabyteDB cluster with 3 nodes, each node has 4 cores, and a total of 12 cores. With range sharding, the data is partitioned into ranges and assigned to specific tablets. This ensures that the load is balanced across the cores, and the range scans are performed efficiently by accessing the specific tablets that contain the relevant data.

This method is useful for use cases where the distribution of data is known and predictable
Example:
- CREATE TABLE sales (sale_id INT, date DATE, PRIMARY KEY (sale_id)) WITH shard_key = RANGE(date) SPLIT AT VALUES (‘2022-01-01’, ‘2022-02-01’);

When to use Hash-based Sharding

Use hash-based sharding when the distribution of data is unknown or unpredictable
Use hash-based sharding when dealing with use cases that involve a high number of writes
Use hash-based sharding when dealing with use cases that have a high number of concurrent users

When to use Range-based Sharding

Use range-based sharding when the distribution of data is known and predictable
Use range-based sharding when dealing with use cases that involve a high number of range queries
Use range-based sharding when dealing with use cases that require efficient scans between point A and point B

In terms of underlying DocDB considerations, both hash and range sharding strategies have their own advantages and disadvantages. Hash sharding is generally more efficient for write and read requests, while range sharding is more efficient for range scans. Additionally, hash sharding is more resilient to hotspots, while range sharding allows for more control over data partitioning.

Conclusion

sharding is a powerful technique that can be used to improve the scalability and performance of your database. In YugabyteDB, you can choose between hash-based and range-based sharding depending on the specific needs of your use case. The choice between the two will depend on the distribution of data, the type of queries you will be running, and the performance requirements of your application.

Reference:

https://docs.yugabyte.com/preview/architecture/docdb-sharding/sharding/

One comment

해외선물 총판

March 27, 2024 / 4:20 pm Reply

I just like the valuable information you supply in your articles.
I will bookmark your blog and take a look at once
more right here regularly. I’m slightly certain I will learn many
new stuff proper here! Good luck for the next!

Hash-based Sharding

Range-based Sharding

When to use Hash-based Sharding

When to use Range-based Sharding

Conclusion

Related Posts

Running Local LLMs on Apple Silicon with MLX and Qwen3

Developer’s Coffee: Brewing k6 v1.0 — A Milestone Release

Getting Started with k6 for Load Testing

One comment

Leave a ReplyCancel Reply