SQL Query Tuning and Performance

10 min readJun 23, 2020

SQL queries can be fast and highly efficient, but they can also be slow and demand excessive CPU and memory resources. For many SQL programmers, occasional bouts with long-running queries and poor performance are simply par for the course. But by gaining a better understanding of how databases translate SQL queries into execution plans, you can take steps to avoid these issues. This read, shows developers how to analyze query execution plans and use data modeling strategies to boost query performance. Describes how SQL queries are executed; highlights different types of indexes and how they factor in query tuning; covers several methods for performing joins.

How is a Query Executed ?

SQL Uses execution plans

SQL queries are declarative in nature
Query processors take these and create procedural plans, which are a set of steps that scan, filter and join data.

Efficient execution plams

Scaning a table of a billion rows to find 10 rows that have a column with a specific value is inefficient.
Using an index to find exactly where rows that have a column with a specific value is efficient.

In this read you will learn —

Steps of a query plan
Trade-offs in ways to implement query plans
Techniques that help derive efficient query plans

Scanning Tables

Scanning is a linear operation — moving from one row to the next and performing some operation like applying a filter to the row data.
Fetch data block containing row
Apply filter or condition

The time taken to finish, is based on the number of rows in the table

Scanning small tables is efficient
Scanning large tables repeatedly is inefficient

Indexing Reduces Full Table Scans

Indexes are ordered subsets of data
Faster to search index of an attribute value.

Types of Indexes

B-tree, for equality and range queries.
Hash indexes, for equality.
Bitmap, efficient ways for performing set operations — like intersection and union.
PostgreSQL provides specialized indexes, for geo-spatial or user-defined indexing strategies.

Execution plan for Table Joins

Joins can be proceduraly executed in the following ways-

Nested Loop Join — Compare all rows in both tables to each other

simple to implement
can be expensive

Hash join — Calculate hash value of each key and join based on matching hash values.

Sort Merge Join — Sort both tables and then join rows while taking advantage of order.

sort both tables
compares rows like nested loop join, but stop scanning when it is not possible to find a match later in the table because of the sort order.

Partitioning Data

Storing table in multiple sub-tables, knows as partitions
Used to improve query, load and delete operations
Usually used for lage tables, when subset of data is accessed or changed then only the partitions that reference that particular data need to be accessed.

Large Tables = Large Indexes

Using a Hash index, then rows can be accessed in constant time.
Using a B-tree index , the depth of the index will grow at a logarithmic rate

Can we do better ?

Break the large tables into smaller partitions
Access a particular partition and scan only the required partition.
To know if the data is in a particular partition use a partition key.

Partition key can be based on time. Paritioned by month, and data generated in a month can be stores in it’s corresponding partition.

Local and Global Indexes along with partitions

Local indexes are used to improve access time to a row in a particular partition
Global indexes are used to fasten the process of determining which partition or set of partitions hold the desired data being manipulated or quried — Helpful if you need to filter data on something other than the partition key.

Optimizing Queries

First step is to understand how a query is executed
One way o find out is to use EXPLAIN command EXPLAIN SELECT * from <table_name> .

Query Plan — Seq Scan on <table_name> ( cost=0.00..24.00 row=1000 width=75)

cost is the measure of computation required to complete the step — 0.00 -> 24.00. The above command completes after 24 units of computation.
To understad the time taken to actually build the execution plan the, ANALYZE command can be used as follows — EXPLAIN ANALYSE SELECT * from <table_name> .

Query Plan — Seq Scan on <table_name> ( cost=0.00..24.00 row=1000 width=75)

Planning Time: 0.361 ms

Execution Time: 0.248 ms

Creating an Index can reduce the query execution time.

Consider a staff table which contains a salary column.
CREATE INDEX idx_staff_salary ON staff(salary) — creates an index on the salary column.
Let’s try to execute a query with a WHERE clause and see if we can warrant a use of the above index.

Index is still not used ! A full table scan is seen, why was no index used ?
WHERE salary > 75000 is not selective enough to warrant the use of an index. There were so many rows with the salary 75000, the query executioner determined it would be better to scan the entire table than to look up the rows using the index.
Let’s try ENXPLAIN ANALYZE SELECT * from staff WHERE salary > 150000

A single row is returned and the index on salary column is used to look up the row

Consider an example data model

Non-trivial purpose of indexes

Help enforce constraints — if we have a unique constraint of the column, then we can add a new row to the table and see if the index column is already in the index. This is faster than scanning the table to see if any row has duplicate data.
Indexes are ordered
Typically smaller than tables. They fit into the memory better. That’s greate news for querying because reading data is more expensive when reading from a HDD or SSD.

Types of Indexes

Balanced-trees

Most commonly used type of index
Used when a large number of possible values in a column — high cardinality.
B-trees always rebalance that keep the depth of the tree on each side. This prevents a lopsided tree that is fast to search on one side and slow on another.
Tree data structure with a root and a node
The tree is balanced because the root node is the index value that splits the range of values on the indexed column.

b-tree has 11 nodes storing the value of indexed column of a table with 11 rows.

For example if an indexed column had values from 1–100, then root is 50.
Each side of the tree has a sub-tree, the top node of the tree splits the value of the indexed column so that values less than the node value are stored in the left branch and values greater than the node value are of the right branch.
In the below example we need to find the node with a value 15

15 — stores a reference to where the data is stored. Time taken is proportional to the log(num_nodes)

B+Trees

Consider a B+Tree, which represents indexes for the rows with IDs — 10,20,40,50,60,70,80,30,35

B+Trees are balanced m-way search trees with the following properties ( m represents the degree of a tree )

Each node can have ceil(m/2) children.
Root node can have minimum 2 children
The leaf nodes have pointers to their siblings
The leaf nodes contain the data from all their ancestors
Each level in a B+Tree is a sparse index
All leaf nodes form the dense index and are at the same level

Consider the below Query -

Let’s create a B-Tree index on the email column

Bitmap Indexes — when columns have few distinct values ( low cardinality )

Stores a series of bits for indexed values. The number of bits used is the same as the number of distinct values in a column.

**is_union_member** has two distinct values — **YES** or NO nd **pay_type** has three distinct values — **Hourly**, **Salary**, **Contractor**

Boolean operations are performed quickly — ( is_union_member = ‘Yes’ ) AND ( pay_type = ‘Hourly’ )
PostgreSQL builds bitmap indexes on the fly as needed.
Consider the following query —

Hash indexes — used when we need look up a value in key-value form

Basis is a hash function — maps arbitrary length data to a fixed size string.
Even slight changes in input produce a different hash value.
There is no order preserving operations with hash functions
Hash functions are used for equality comparisons, but not for ranges of values.
Smaller size than B-Tree indexes but just as fast as B-Tree indexes. This is an advantage as the index can fit in-memory

PostgreSQL specific Indexes

GIST

Generalised search tree — Framework for implementing other types of indexes

SP-GIST — Space-partitioned GIST

Supports partitioned search trees —Used for Non Balanced data structures

BRIN

Divides data into ordered blocks
Keeps min and max values
Searches only blocks that may have a match

Joins

Declarative Joins statements → Procedural plans

Nested Loop Joins

Naive approach — Outer loop iterates over one table, the driver table
Inner loop iterates over other table, the join table
In general , this method has low performance and can worsen if the tables do not fit in memory.
In PostgreSQL enabling a nested loop join involves the below

`set enable_nestloop=true;`

`set enable_hashjoin=false;`

`set enable_mergejoin=false;`

Hash Joins

Function for mapping arbitrary length data to a value that can act as an index into an array — f(x) = y
Build a hash table — Smaller of the two tables involved in the joins is used

Probe Hash Table

Step through large table
Compute hash value of the primary or the foriegn key value
Look up corresponding value in hash table

Example — Let’s start with a Query by forcing the use of hash join

Row 4 indicates that the hash table is built on the company_regions table
Row 3 indicates a sequencial scan happening on the staff table calculating the hash value which is compared in row2.
The total cost of this joins is just under 24 computational units

Merge Joins — Sort Merge Joins

Start by sorting the input tables.
Takes advantage or ordering to reduce number of rows checked

Example — Let’s consider an example for Merge Joins

We’ll consider the staff table and the company_regions table and force the query executioner to user merge joins by setting the directives

set enable_nestedloop=false;

set enable_hashjoin=false;

set enable_mergejoin=true;

Rows 3–6 show sorting of the staff table while rows 7–8 show the sorting of the company_regions table. Rows 1 and 2 show merge part of the join using the condition to match rows — s.region_id=cr.region_id
The cost of this merge join ( 115 computational units )is signigicantly higher than hash joins ( 24 computational units )
Query plan builder chooses the optimal plan builder.

Partitioning

Horizotal Partitioning

Split tables by rows into partitions
Each partition is treated as a table

Benefits

We can limit scans to a small number of partitions using statistics on the data being queried
Indexes for each partitions are smaller and leads to effienct query executions
DML queries are much more efficient and faster. Bulk data operations are faster because of the small in size and entire partitions can be dropped.

Vertical Partitioning

During data modelling, implement multiple tables. Assign different partitions to different tables.
Vertical partitioning separates the columns of a large table into multiple tables.
Keep frequently queried columns together. This also requires that the same primary key is used across partitions.

Benefits

Increasing the number of rows stored in a single data block.
Global indexes are useful for each partition
Because columns are separated and logically grouped, less data is read for each query which can reduce I/O
Columnar data storage strategies also provide similar benefits — Do refer to Aerospike’s columnar data paradims for more information on this.

Example for Horizontal Partitioning — Internet of Things use case ( Range Partitioning )

Partitioning helps for time based reports
Comparative queries, for example, same time last year

List Partitioning

Used when data logically groups into subgroups. Often queries are made within a partition. Also when, data is not time based to warrant range partition by time.

Partition Key — Determines which partition is used for data ( prod_category )
Partition Bounds- List of values for a partition
Constraints- Each partition can have it’s own indexes, constraints and defaults

SQL Query Tuning and Performance

How is a Query Executed ?

In this read you will learn —

Scanning Tables

Indexing Reduces Full Table Scans

Types of Indexes

Execution plan for Table Joins

Partitioning Data

Optimizing Queries

Types of Indexes

Joins

Partitioning

Horizotal Partitioning

Vertical Partitioning

Written by Sreesha S Kuruvadi