Sharding in YugabyteDB: Understanding Hash and Range Partitioning

This post provides a detailed comparison of the two sharding methods and provides examples of when to use each one, along with SQL examples. It also explains the underlying hardware and DocDB considerations while using sharding in YugabyteDB.

Sharding is a technique used to horizontally partition data across multiple servers to improve performance and scalability. It allows for a more efficient use of resources, improved scalability and better performance when working with large amounts of data. YugabyteDB supports two types of sharding: hash-based and range-based. In this post, we will explore the differences between the two and provide examples of when to use each one.

Hash-based Sharding

  • Hash-based sharding is the default sharding method in YugabyteDB
  • It distributes data evenly across multiple servers by applying a hash function to the partition key
  • Hash sharding distributes data uniformly across all tablets, using a hash function to determine the tablet for a given piece of data. This strategy is useful for workloads that have a high degree of write and read requests and can handle a uniform data distribution. Hash sharding is the default sharding strategy in YugabyteDB.
  • To illustrate this in context of number of CPU, nodes, and hardware, consider a YugabyteDB cluster with 3 nodes, each node has 4 cores, and a total of 12 cores. With hash sharding, the data is evenly distributed across all the nodes and tablets, ensuring that the load is balanced across all the cores.
  • The result of the hash function determines which server the data is stored on
  • This method is useful for use cases where the distribution of data is unknown or unpredictable
  • Example:
    • CREATE TABLE orders (order_id INT, customer_name VARCHAR(255), PRIMARY KEY (order_id)) WITH shard_key = HASH(order_id);

Range-based Sharding

  • Range-based sharding allows for explicit control over the split points for data partitioning
  • The user specifies the split points, and the data is divided into ranges based on the partition key
  • It allows for explicit control over the split points and is useful for workloads that require efficient range scans. With range sharding, data is partitioned into a set of continuous ranges, with each range being assigned to a specific tablet. This strategy is useful for workloads that have a high degree of range scans and can handle a non-uniform data distribution.
  • To illustrate this in context of number of CPU, nodes, and hardware, consider a YugabyteDB cluster with 3 nodes, each node has 4 cores, and a total of 12 cores. With range sharding, the data is partitioned into ranges and assigned to specific tablets. This ensures that the load is balanced across the cores, and the range scans are performed efficiently by accessing the specific tablets that contain the relevant data.
  • This method is useful for use cases where the distribution of data is known and predictable
  • Example:
    • CREATE TABLE sales (sale_id INT, date DATE, PRIMARY KEY (sale_id)) WITH shard_key = RANGE(date) SPLIT AT VALUES (‘2022-01-01’, ‘2022-02-01’);

When to use Hash-based Sharding

  • Use hash-based sharding when the distribution of data is unknown or unpredictable
  • Use hash-based sharding when dealing with use cases that involve a high number of writes
  • Use hash-based sharding when dealing with use cases that have a high number of concurrent users

When to use Range-based Sharding

  • Use range-based sharding when the distribution of data is known and predictable
  • Use range-based sharding when dealing with use cases that involve a high number of range queries
  • Use range-based sharding when dealing with use cases that require efficient scans between point A and point B

In terms of underlying DocDB considerations, both hash and range sharding strategies have their own advantages and disadvantages. Hash sharding is generally more efficient for write and read requests, while range sharding is more efficient for range scans. Additionally, hash sharding is more resilient to hotspots, while range sharding allows for more control over data partitioning.

Conclusion

sharding is a powerful technique that can be used to improve the scalability and performance of your database. In YugabyteDB, you can choose between hash-based and range-based sharding depending on the specific needs of your use case. The choice between the two will depend on the distribution of data, the type of queries you will be running, and the performance requirements of your application.

Reference:

https://docs.yugabyte.com/preview/architecture/docdb-sharding/sharding/

One comment

  1. I just like the valuable information you supply in your articles.
    I will bookmark your blog and take a look at once
    more right here regularly. I’m slightly certain I will learn many
    new stuff proper here! Good luck for the next!

Leave a Reply

Your email address will not be published. Required fields are marked *