Introduction
PostgreSQL is a powerful open-source relational database management system that is widely used in various applications. However, when it comes to distributed SQL environments like Yugabyte, optimizing PostgreSQL performance can be a challenging task. Yugabyte is a distributed SQL database that provides the ability to scale and distribute data across multiple regions, making it a great choice for high-availability applications. In this blog post, we will discuss ten tips for optimizing PostgreSQL performance in Yugabyte distributed SQL environments, including best practices for query optimization, index creation, and more.
Define the primary key in the CREATE TABLE rather than ALTER TABLE
When creating a new table, it’s important to define the primary key during the initial CREATE TABLE statement rather than adding it later with an ALTER TABLE statement. Here’s an example of creating a table with a primary key:
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name TEXT,
age INTEGER
);
This statement creates a new table called employees
with columns id
, name
, and age
. The id
column is defined as a SERIAL type, which automatically generates a unique value for each new row. The PRIMARY KEY
constraint is added to the id
column, making it the primary key for the table.
HASH sharding is good to distribute when columns are used in point queries (!= range)
Hash sharding is a type of data partitioning where rows are distributed across nodes based on a hash function applied to a specific column in the table. This is useful when the column is frequently used in point queries, which means queries that filter by a specific value rather than a range of values. Here’s an example of creating a hash-partitioned table:
CREATE TABLE sales (
id SERIAL,
product_id INTEGER,
sale_date DATE,
amount NUMERIC,
PRIMARY KEY (id, product_id)
)
PARTITION BY HASH (product_id);
This statement creates a new table called sales
with columns id
, product_id
, sale_date
, and amount
. The id
and product_id
columns are defined as the primary key for the table, and the table is partitioned using the HASH
method on the product_id
column.
EXPLAIN ANALYZE is your way to understand why a query is slow
EXPLAIN ANALYZE is a PostgreSQL command that provides detailed information about the query plan and execution statistics for a given query. This is useful for understanding why a query is slow and how it can be optimized. Here’s an example of using EXPLAIN ANALYZE:
EXPLAIN ANALYZE SELECT * FROM sales WHERE product_id = 123;
This command will execute the query and provide detailed information about the query plan and execution statistics, including the estimated and actual row counts, the time spent on each operation, and the query’s total execution time.
Watch for Rows Removed by Filter with high amount of rows
When analyzing the EXPLAIN ANALYZE output, it’s important to watch for the “Rows Removed by Filter” statistic, which indicates how many rows were removed from the result set by a filter operation. If this number is high, it may indicate that the query is inefficient or that an index is missing. Here’s an example of the EXPLAIN ANALYZE output with a high number of rows removed by filter:
QUERY PLAN
----------------------------------------------------------
Seq Scan on sales (cost=0.00..1234.56 rows=1000000 width=100)
Filter: (product_id = 123)
Rows Removed by Filter: 999877
This output shows that the query scanned 1 million rows but only returned 123, and that 999,877 rows were removed by the filter operation.
Always enable b_enable_expression_pushdown
The b_enable_expression_pushdown configuration parameter in PostgreSQL enables pushdown of filter expressions to the storage engine. This can significantly improve query performance by reducing the amount of data that needs to be read from disk. To enable this setting, add the following line to your PostgreSQL configuration file:
set yb_enable_expression_pushdown=on;
Covering indexes are important in Distributed SQL (to avoid PCs)
In a distributed SQL environment, where queries may need to access data from multiple nodes, covering indexes can be very useful. A covering index is an index that includes all the columns needed to satisfy a query, so that the query can be satisfied entirely from the index without having to read any data from the table. This can significantly reduce the amount of network traffic and disk I/O needed to execute a query, which is especially important in a distributed system.
Here’s an example of creating a covering index:
CREATE INDEX idx_sales_product_id ON sales (product_id) INCLUDE (sale_date, amount);
This statement creates a new index called idx_sales_product_id on the product_id column of the sales table, and includes the sale_date and amount columns in the index as well. This makes the index a covering index for queries that only need to access these three columns.
Don’t SELECT * if you don’t need all the columns
When querying a table, it’s best to only select the columns that are needed for the query rather than using SELECT *
to select all columns. This can help to reduce the amount of data that needs to be transmitted over the network and processed by the query engine. Here’s an example of a more selective SELECT statement:
SELECT id, name FROM employees WHERE age > 30;
This statement only selects the id
and name
columns from the employees
table, and only returns rows where the age
column is greater than 30.
Use Partial Index if you don’t need the index for all the rows
If an index is only needed for a subset of the rows in a table, it’s more efficient to create a partial index rather than a full index. A partial index is an index that includes only the rows that meet a certain condition, which can reduce the size of the index and improve query performance. Here’s an example of creating a partial index:
CREATE INDEX idx_sales_recent ON sales (sale_date) WHERE sale_date > '2022-01-01';
This statement creates a new index called idx_sales_recent
on the sale_date
column of the sales
table, but only includes rows where the sale_date
is after January 1, 2022.
Check the join order in the execution plan, provide the right indexes to filter early
When executing a query that involves multiple tables, it’s important to check the join order in the execution plan to ensure that the tables are joined in the most efficient order. Additionally, providing the right indexes can help to filter rows early in the query execution and reduce the amount of data that needs to be processed. Here’s an example of using an index to filter rows early in a query:
SELECT *
FROM sales
JOIN products ON sales.product_id = products.id
WHERE products.name = 'Widget'
In this example, the query joins the sales
and products
tables on the product_id
and id
columns, respectively. If there are many rows in the sales
table, it can be more efficient to filter the rows early by adding an index on the product_id
column of the sales
table.
For complex query, pg_hint_plan may be used to define the execution plan:
For complex queries that involve many tables and conditions, it can be difficult to manually optimize the execution plan. In these cases, the pg_hint_plan extension for PostgreSQL can be used to provide hints to the query planner about how to optimize the query. Here’s an example of using pg_hint_plan:
-- Load the pg_hint_plan extension
CREATE EXTENSION pg_hint_plan;
-- Add a hint to the query
SELECT *
FROM sales
JOIN products ON sales.product_id = products.id
WHERE products.name = 'Widget'
/*+ JOIN(products) */;
In this example, the pg_hint_plan
extension is loaded, and a hint is added to the query using the /*+ JOIN(table_name) */
syntax. This hint tells the query planner to use a nested loop join between the sales
and products
tables. Without the hint, the query planner may choose a different join method that is less efficient.
Summary
In summary, optimizing PostgreSQL performance in Yugabyte distributed SQL environments requires a deep understanding of the database and the underlying infrastructure. By following the tips outlined in this blog post, you can improve query performance, reduce latency, and increase scalability. These tips include creating the right indexes, avoiding unnecessary data retrieval, using partial indexes, and optimizing join order, among others. Additionally, by using tools like pg_hint_plan, you can optimize complex queries and improve the overall performance of PostgreSQL in Yugabyte. By implementing these best practices, you can ensure that your applications are running efficiently and effectively in a distributed SQL environment.
Reference
https://docs.yugabyte.com/preview/explore/indexes-constraints/primary-key-ysql/
https://dev.to/yugabyte/rows-removed-by-index-recheck-on-yugabytedb-index-scan-38aj
https://dev.to/franckpachot/explain-analyze-dist-4nlc
https://docs.yugabyte.com/preview/explore/indexes-constraints/covering-index-ysql/
https://docs.yugabyte.com/preview/explore/query-1-performance/explain-analyze/
https://docs.yugabyte.com/preview/api/ycql/dml_select/
https://docs.yugabyte.com/preview/explore/indexes-constraints/partial-index-ycql/
https://docs.yugabyte.com/preview/explore/ysql-language-features/queries/
https://docs.yugabyte.com/preview/explore/query-1-performance/pg-hint-plan/
Thank you for the auspicious writeup. It
in fact was a amusement account it. Look advanced to more added agreeable
from you! By the way, how could we communicate?
It’s an remarkable paragraph in support of all the internet users; they
will take benefit from it I am sure.
We’re a group of volunteers and starting a new scheme in our community.
Your website provided us with valuable information to work on. You have done an impressive job and our entire community will
be thankful to you.