SQL Query Tuning and Performance
SQL queries can be fast and highly efficient, but they can also be slow and demand excessive CPU and memory resources. For many SQL programmers, occasional bouts with long-running queries and poor performance are simply par for the course. But by gaining a better understanding of how databases translate SQL queries into execution plans, you can take steps to avoid these issues. This read, shows developers how to analyze query execution plans and use data modeling strategies to boost query performance. Describes how SQL queries are executed; highlights different types of indexes and how they factor in query tuning; covers several methods for performing joins.
How is a Query Executed ?
SQL Uses execution plans
- SQL queries are declarative in nature
- Query processors take these and create procedural plans, which are a set of steps that scan, filter and join data.
Efficient execution plams
- Scaning a table of a billion rows to find 10 rows that have a column with a specific value is inefficient.
- Using an index to find exactly where rows that have a column with a specific value is efficient.
In this read you will learn —
- Steps of a query plan
- Trade-offs in ways to implement query plans
- Techniques that help derive efficient query plans
Scanning Tables
- Scanning is a linear operation — moving from one row to the next and performing some operation like applying a filter to the row data.
- Fetch data block containing row
- Apply filter or condition
The time taken to finish, is based on the number of rows in the table
- Scanning small tables is efficient
- Scanning large tables repeatedly is inefficient
Indexing Reduces Full Table Scans
- Indexes are ordered subsets of data
- Faster to search index of an attribute value.
Types of Indexes
- B-tree, for equality and range queries.
- Hash indexes, for equality.
- Bitmap, efficient ways for performing set operations — like intersection and union.
- PostgreSQL provides specialized indexes, for geo-spatial or user-defined indexing strategies.
Execution plan for Table Joins
Joins can be proceduraly executed in the following ways-
Nested Loop Join — Compare all rows in both tables to each other
- simple to implement
- can be expensive
Hash join — Calculate hash value of each key and join based on matching hash values.
Sort Merge Join — Sort both tables and then join rows while taking advantage of order.
- sort both tables
- compares rows like nested loop join, but stop scanning when it is not possible to find a match later in the table because of the sort order.
Partitioning Data
- Storing table in multiple sub-tables, knows as partitions
- Used to improve query, load and delete operations
- Usually used for lage tables, when subset of data is accessed or changed then only the partitions that reference that particular data need to be accessed.
Large Tables = Large Indexes
- Using a Hash index, then rows can be accessed in constant time.
- Using a B-tree index , the depth of the index will grow at a logarithmic rate
Can we do better ?
- Break the large tables into smaller partitions
- Access a particular partition and scan only the required partition.
- To know if the data is in a particular partition use a partition key.
- Partition key can be based on time. Paritioned by month, and data generated in a month can be stores in it’s corresponding partition.
Local and Global Indexes along with partitions
- Local indexes are used to improve access time to a row in a particular partition
- Global indexes are used to fasten the process of determining which partition or set of partitions hold the desired data being manipulated or quried — Helpful if you need to filter data on something other than the partition key.
Optimizing Queries
- First step is to understand how a query is executed
- One way o find out is to use
EXPLAIN
commandEXPLAIN SELECT * from <table_name>
.
Query Plan — Seq Scan on <table_name> ( cost=0.00..24.00 row=1000 width=75)
- cost is the measure of computation required to complete the step — 0.00 -> 24.00. The above command completes after 24 units of computation.
- To understad the time taken to actually build the execution plan the,
ANALYZE
command can be used as follows —EXPLAIN ANALYSE SELECT * from <table_name>
.
Query Plan — Seq Scan on <table_name> ( cost=0.00..24.00 row=1000 width=75)
Planning Time: 0.361 ms
Execution Time: 0.248 ms
Creating an Index can reduce the query execution time.
- Consider a staff table which contains a salary column.
CREATE INDEX idx_staff_salary ON staff(salary)
— creates an index on the salary column.- Let’s try to execute a query with a WHERE clause and see if we can warrant a use of the above index.
- Index is still not used ! A full table scan is seen, why was no index used ?
- WHERE salary > 75000 is not selective enough to warrant the use of an index. There were so many rows with the salary 75000, the query executioner determined it would be better to scan the entire table than to look up the rows using the index.
- Let’s try
ENXPLAIN ANALYZE SELECT * from staff WHERE salary > 150000
- Consider an example data model
Non-trivial purpose of indexes
- Help enforce constraints — if we have a unique constraint of the column, then we can add a new row to the table and see if the index column is already in the index. This is faster than scanning the table to see if any row has duplicate data.
- Indexes are ordered
- Typically smaller than tables. They fit into the memory better. That’s greate news for querying because reading data is more expensive when reading from a HDD or SSD.
Types of Indexes
Balanced-trees
- Most commonly used type of index
- Used when a large number of possible values in a column — high cardinality.
- B-trees always rebalance that keep the depth of the tree on each side. This prevents a lopsided tree that is fast to search on one side and slow on another.
- Tree data structure with a root and a node
- The tree is balanced because the root node is the index value that splits the range of values on the indexed column.
- For example if an indexed column had values from 1–100, then root is 50.
- Each side of the tree has a sub-tree, the top node of the tree splits the value of the indexed column so that values less than the node value are stored in the left branch and values greater than the node value are of the right branch.
- In the below example we need to find the node with a value 15
B+Trees
- Consider a B+Tree, which represents indexes for the rows with IDs — 10,20,40,50,60,70,80,30,35
B+Trees are balanced m-way search trees with the following properties ( m represents the degree of a tree )
- Each node can have ceil(m/2) children.
- Root node can have minimum 2 children
- The leaf nodes have pointers to their siblings
- The leaf nodes contain the data from all their ancestors
- Each level in a B+Tree is a sparse index
- All leaf nodes form the dense index and are at the same level
Consider the below Query -
- Let’s create a B-Tree index on the email column
Bitmap Indexes — when columns have few distinct values ( low cardinality )
- Stores a series of bits for indexed values. The number of bits used is the same as the number of distinct values in a column.
- Boolean operations are performed quickly — ( is_union_member = ‘Yes’ ) AND ( pay_type = ‘Hourly’ )
- PostgreSQL builds bitmap indexes on the fly as needed.
- Consider the following query —
Hash indexes — used when we need look up a value in key-value form
- Basis is a hash function — maps arbitrary length data to a fixed size string.
- Even slight changes in input produce a different hash value.
- There is no order preserving operations with hash functions
- Hash functions are used for equality comparisons, but not for ranges of values.
- Smaller size than B-Tree indexes but just as fast as B-Tree indexes. This is an advantage as the index can fit in-memory
PostgreSQL specific Indexes
GIST
- Generalised search tree — Framework for implementing other types of indexes
SP-GIST — Space-partitioned GIST
- Supports partitioned search trees —Used for Non Balanced data structures
BRIN
- Divides data into ordered blocks
- Keeps min and max values
- Searches only blocks that may have a match
Joins
Declarative Joins statements → Procedural plans
Nested Loop Joins
- Naive approach — Outer loop iterates over one table, the driver table
- Inner loop iterates over other table, the join table
- In general , this method has low performance and can worsen if the tables do not fit in memory.
- In PostgreSQL enabling a nested loop join involves the below
`set enable_nestloop=true;`
`set enable_hashjoin=false;`
`set enable_mergejoin=false;`
Hash Joins
- Function for mapping arbitrary length data to a value that can act as an index into an array — f(x) = y
- Build a hash table — Smaller of the two tables involved in the joins is used
Probe Hash Table
- Step through large table
- Compute hash value of the primary or the foriegn key value
- Look up corresponding value in hash table
Example — Let’s start with a Query by forcing the use of hash join
- Row 4 indicates that the hash table is built on the company_regions table
- Row 3 indicates a sequencial scan happening on the staff table calculating the hash value which is compared in row2.
- The total cost of this joins is just under 24 computational units
Merge Joins — Sort Merge Joins
- Start by sorting the input tables.
- Takes advantage or ordering to reduce number of rows checked
Example — Let’s consider an example for Merge Joins
We’ll consider the staff table and the company_regions table and force the query executioner to user merge joins by setting the directives
set enable_nestedloop=false;
set enable_hashjoin=false;
set enable_mergejoin=true;
- Rows 3–6 show sorting of the staff table while rows 7–8 show the sorting of the company_regions table. Rows 1 and 2 show merge part of the join using the condition to match rows — s.region_id=cr.region_id
- The cost of this merge join ( 115 computational units )is signigicantly higher than hash joins ( 24 computational units )
- Query plan builder chooses the optimal plan builder.
Partitioning
Horizotal Partitioning
- Split tables by rows into partitions
- Each partition is treated as a table
Benefits
- We can limit scans to a small number of partitions using statistics on the data being queried
- Indexes for each partitions are smaller and leads to effienct query executions
- DML queries are much more efficient and faster. Bulk data operations are faster because of the small in size and entire partitions can be dropped.
Vertical Partitioning
- During data modelling, implement multiple tables. Assign different partitions to different tables.
- Vertical partitioning separates the columns of a large table into multiple tables.
- Keep frequently queried columns together. This also requires that the same primary key is used across partitions.
Benefits
- Increasing the number of rows stored in a single data block.
- Global indexes are useful for each partition
- Because columns are separated and logically grouped, less data is read for each query which can reduce I/O
- Columnar data storage strategies also provide similar benefits — Do refer to Aerospike’s columnar data paradims for more information on this.
Example for Horizontal Partitioning — Internet of Things use case ( Range Partitioning )
- Partitioning helps for time based reports
- Comparative queries, for example, same time last year
List Partitioning
Used when data logically groups into subgroups. Often queries are made within a partition. Also when, data is not time based to warrant range partition by time.
- Partition Key — Determines which partition is used for data ( prod_category )
- Partition Bounds- List of values for a partition
- Constraints- Each partition can have it’s own indexes, constraints and defaults