Why Covering Indexes in Postgres Are Incredibly Helpful
The PostgreSQL 11 release is nearly here (maybe in the next couple of weeks?!), and while a lot of the focus will be on the improvements to the overall performance of the system (and rightly so!), it's important to notice some features that when used appropriately, will provide noticeable performance improvements to your applications.
One example of such feature is the introduction of "covering indexes" for B-tree indexes. A covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included in the index. You can specify the additional columns for the index using the "INCLUDE" keyword, e.g.
CREATE INDEX a_b_idx ON x (a,b) INCLUDE (c);
Theoretically, this can reduce the amount of I/O your query needs to use in order to retrieve information (traditionally, I/O is the biggest bottleneck on database systems). Additionally, the data types included in a covering index do not need to be B-tree indexable; you can add any data type to the INCLUDE part of a CREATE INDEX statement.
However, you still need to be careful how you deploy covering indexes: each column you add to the index still takes up space on disk, and there is still a cost for maintaining the index, for examples, on row updates.
Understanding these trade offs, you can still apply covering indexes in very helpful ways that can significantly help your applications.
A Simple Example: Tracking Coffee Shop Visits
Let's pretend we're building a location-aware tracking application to track which coffee shops we are visiting and what times of day we are drinking coffee. This can be accomplished using a table such as:
CREATE TABLE visits (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
visited_at timestamptz NOT NULL,
geocode point NOT NULL
);
Notice the use of the "point" data type. This stores the geocoordinate of the coffee shop that we visited, which ultimately could be used to find the shop on a map. If this were a more robust application, we should probably use one of the data types offered by PostGIS, but for the sake of this exercise we can rely on the point type.
Let's say we've managed to track coffee purchases since 2004, and let's also say that we drink coffee around three times per day. In other words, the data set looks roughy like this:
SELECT * FROM visits ORDER BY visited_at LIMIT 10;
id | visited_at | geocode
------+-------------------------------+------------------------
1 | 2004-01-01 07:53:31.751522-05 | (40.619171,-73.989913)
2 | 2004-01-01 09:18:07.670386-05 | (40.760875,-73.979729)
3 | 2004-01-01 13:35:40.055012-05 | (40.761401,-73.967862)
4 | 2004-01-02 05:13:38.413874-05 | (40.755043,-73.986415)
5 | 2004-01-02 11:54:25.981625-05 | (40.753176,-73.974975)
6 | 2004-01-02 16:46:08.406623-05 | (40.577865,-73.961132)
7 | 2004-01-03 07:33:38.423921-05 | (40.762228,-73.986567)
8 | 2004-01-03 11:26:13.409296-05 | (40.710912,-74.009844)
9 | 2004-01-03 15:47:06.933456-05 | (40.708458,-74.007531)
10 | 2004-01-04 05:55:04.428365-05 | (40.704552,-74.008893)
Let's say we want to look up the times and locations I drank coffee on October 1, 2012 ordered from earliest to latest. We could run a query that looks like this:
SELECT visited_at, geocode
FROM visits
WHERE visited_at BETWEEN '2012-10-01' AND '2012-10-02'
ORDER BY visited_at;
(Note: Technically, the BETWEEN .. AND is inclusive, i.e. it translates to '2012-10-01 0:00' <= visited_at <= '2012-10-02 0:00' but we can safely assume in this exercise that we are not drinking coffee at midnight on the dot).
which returns something like:
visited_at | geocode
-------------------------------+------------------------
2012-10-01 07:58:23.440596-04 | (40.722536,-73.997959)
2012-10-01 11:47:47.744383-04 | (40.755352,-73.983972)
2012-10-01 15:44:10.670968-04 | (40.718045,-74.007222)
Let's see the query plan. For the purposes of this exercise, we are going to use EXPLAIN ANALYZE to not only see the query plan, but see how the database will execute the plan. Because proper benchmarking is a very large topic, please use the the query times as directional for the purposes of this demonstration.
As such, when executing the following:
EXPLAIN ANALYZE SELECT visited_at, geocode
FROM visits
WHERE visited_at BETWEEN '2012-10-01' AND '2012-10-02'
ORDER BY visited_at;
we should see a query plan similar to:
Sort (cost=360.74..360.75 rows=4 width=28) (actual time=2.330..2.330 rows=3 loops=1)
Sort Key: visited_at
Sort Method: quicksort Memory: 25kB
-> Seq Scan on visits (cost=0.00..360.69 rows=4 width=28) (actual time=1.237..2.318 rows=3 loops=1)
Filter: ((visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-02 00:00:00-04'::timestamp with time zone))
Rows Removed by Filter: 16110
Planning Time: 0.120 ms
Execution Time: 2.352 ms
This should make sense: at present, the only index on the visits table is the unique index from our PRIMARY KEY definition. In order to find all the coffee shops I visited on Oct 1, 2012, it will need to scan every single row in the table (a "sequential scan" or "seq scan" in the query plan).
Adding a UNIQUE Index
Let's try adding an index. We know that a person can only be in one place at one time, so it should be safe to add a UNIQUE index on (visited_at, geocode).
There is a problem though: as of PostgreSQL 11, the point data type does not have an equality operator defined, so when we try to run:
CREATE UNIQUE INDEX visits_visited_at_geocode_idx ON visits (visited_at, geocode);
We receive the following error:
ERROR: data type point has no default operator class for access method "btree"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
Given this table is just tracking my single visits to coffee shops, we can add a unique index just on the visited_at column:
CREATE UNIQUE INDEX visits_visited_at_idx ON visits (visited_at);
We should also run ANALYZE visits; to update the statistics tables for the query planner.
Let's try again to execute the query to look up all of the coffee shops I visited on Oct 1, 2012. When we run:
EXPLAIN ANALYZE SELECT visited_at, geocode
FROM visits
WHERE visited_at BETWEEN '2012-10-01' AND '2012-10-02'
ORDER BY visited_at;
We should see a query plan similar to:
Index Scan using visits_visited_at_idx on visits (cost=0.29..8.36 rows=4 width=24) (actual time=0.062..0.065 rows=3 loops=1)
Index Cond: ((visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-02 00:00:00-04'::timestamp with time zone))
Planning Time: 0.262 ms
Execution Time: 0.095 ms
Great - the query is able to take advantage of the index we just created and directionally appears to be faster. The query performs an "index scan" and searches through the index for the matching rows. It then goes to the table and returns any additional columns that are not included in the columns, in this case, geocode.
This query plan is definitely better than our previous one that did not utilize an index -- but can we do any better?
Introducing the Covering Index
As mentioned earlier, a covering index can include columns in the index that may not be indexable by a B-tree index as well as potentially reduce the amount of I/O when retrieving data. Let's include the geocode column in our UNIQUE index on visited_at. First, let's drop the original index:
DROP INDEX visits_visited_at_idx;
Now, let's create a new UNIQUE index on visited_at that covers geocode:
CREATE UNIQUE INDEX visits_visited_at_geocode_idx ON visits (visited_at) INCLUDE (geocode);
This time, I am going to run VACUUM ANALYZE visits; to not only updated the statistics for the query planner, but also to update the visibility map and help ensure that PostgreSQL can perform an index-only scan.
EXPLAIN ANALYZE SELECT visited_at, geocode
FROM visits
WHERE visited_at BETWEEN '2012-10-01' AND '2012-10-02'
ORDER BY visited_at;
I should see a query plan similar to:
Index Only Scan using visits_visited_at_geocode_idx on visits (cost=0.29..13.61 rows=4 width=24) (actual time=0.039..0.042 rows=3 loops=1)
Index Cond: ((visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-02 00:00:00-04'::timestamp with time zone))
Heap Fetches: 0
Planning Time: 0.208 ms
Execution Time: 0.060 ms
Excellent, we were able to perform an index-only scan! This means that all the data was loaded directly from the index - PostgreSQL did not have to load anything from the table! Directionally, the query appears to execute slightly faster, too.
Given my data set is pretty small (~10,000 rows), we won't necessarily see much benefit to using covering indexes, so let's take a look at a slightly larger data set.
A Larger Application: More Users, More Coffee Shop Visits
Let's say we modify our application to allow multiple users to check in whenever they order coffee at their favorite establishment. In other words, we're adding a social component to our typical location-aware check-in application.
To store the data for our coffee shop visiting application, we can use a table similar to this:
CREATE TABLE visits (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
visitor uuid NOT NULL,
visited_at timestamptz NOT NULL,
geocode point NOT NULL
);
(Note: let's also say we are collecting the data from devices that are sharing a UUID).
Let's also say that there are 250 users of this application, and each user drinks coffee 3 times a day and has been doing so since 2014. That is over 4,000,000 data points!
Now let's say I want to see where five of my friends were drinking coffee from Oct 1. 2012 to Oct 5, 2012. We could execute a query like this:
SELECT visitor, visited_at, geocode
FROM visits
WHERE
visited_at BETWEEN '2012-10-01' AND '2012-10-05' AND
visitor IN (
'f806ab28-4087-4f23-aab3-6facb044d1fb',
'21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47',
'e881e88a-902a-4513-94b1-bb2b797753a4',
'bcc2eb96-9be9-42d8-977f-79087debf30e',
'905cd23a-3f74-4728-a1d3-ecde205a7da1'
)
ORDER BY visitor, visited_at;
Without any indexing (and with parallel query turned on), if we run the following:
EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
visited_at BETWEEN '2012-10-01' AND '2012-10-05' AND
visitor IN (
'f806ab28-4087-4f23-aab3-6facb044d1fb',
'21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47',
'e881e88a-902a-4513-94b1-bb2b797753a4',
'bcc2eb96-9be9-42d8-977f-79087debf30e',
'905cd23a-3f74-4728-a1d3-ecde205a7da1'
)
ORDER BY visitor, visited_at;
We might see a query plan like:
Gather Merge (cost=60048.31..60055.50 rows=60 width=40) (actual time=146.559..156.014 rows=60 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=59048.25..59048.29 rows=15 width=40) (actual time=141.108..141.117 rows=12 loops=5)
Sort Key: visitor, visited_at
Sort Method: quicksort Memory: 26kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
Worker 2: Sort Method: quicksort Memory: 26kB
Worker 3: Sort Method: quicksort Memory: 25kB
-> Parallel Seq Scan on visits (cost=0.00..59047.96 rows=15 width=40) (actual time=82.108..140.903 rows=12 loops=5)
Filter: ((visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-05 00:00:00-04'::timestamp with time zone) AND (visitor = ANY ('{f806ab28-4087-4f23-aab3-6facb044d1fb,21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47,e881e88a-902a-4513-94b1-bb2b797753a4,bcc2eb96-9be9-42d8-977f-79087debf30e,905cd23a-3f74-4728-a1d3-ecde205a7da1}'::uuid[])))
Rows Removed by Filter: 805638
Planning Time: 0.115 ms
Execution Time: 156.047 ms
Though PostgreSQL is able to perform a parallel sequential scan on my query, it is still a sequential scan, and we may be able to have better speed up by utilizing an index.
Let's try creating a UNIQUE index over visitor and visited_at (we know that it will error if we also add geocode to the UNIQUE group).
CREATE UNIQUE INDEX visits_visitor_visited_at_idx ON visits (visitor, visited_at);
As we did earlier, let's run "ANALYZE visits;"
If we run:
EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
visited_at BETWEEN '2012-10-01' AND '2012-10-05' AND
visitor IN (
'f806ab28-4087-4f23-aab3-6facb044d1fb',
'21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47',
'e881e88a-902a-4513-94b1-bb2b797753a4',
'bcc2eb96-9be9-42d8-977f-79087debf30e',
'905cd23a-3f74-4728-a1d3-ecde205a7da1'
)
ORDER BY visitor, visited_at;
We should see a query plan like:
Sort (cost=262.65..262.81 rows=61 width=40) (actual time=0.175..0.183 rows=60 loops=1)
Sort Key: visitor, visited_at
Sort Method: quicksort Memory: 29kB
-> Bitmap Heap Scan on visits (cost=22.92..260.85 rows=61 width=40) (actual time=0.070..0.129 rows=60 loops=1)
Recheck Cond: ((visitor = ANY ('{f806ab28-4087-4f23-aab3-6facb044d1fb,21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47,e881e88a-902a-4513-94b1-bb2b797753a4,bcc2eb96-9be9-42d8-977f-79087debf30e,905cd23a-3f74-4728-a1d3-ecde205a7da1}'::uuid[])) AND (visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-05 00:00:00-04'::timestamp with time zone))
Heap Blocks: exact=28
-> Bitmap Index Scan on visits_visitor_visited_at_idx (cost=0.00..22.90 rows=61 width=0) (actual time=0.056..0.056 rows=60 loops=1)
Index Cond: ((visitor = ANY ('{f806ab28-4087-4f23-aab3-6facb044d1fb,21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47,e881e88a-902a-4513-94b1-bb2b797753a4,bcc2eb96-9be9-42d8-977f-79087debf30e,905cd23a-3f74-4728-a1d3-ecde205a7da1}'::uuid[])) AND (visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-05 00:00:00-04'::timestamp with time zone))
Planning Time: 0.157 ms
Execution Time: 0.223 ms
Overall, that query plan is pretty good. It is able to utilize the index we created, and directionally it is much faster than the previous query. However, the query still needs to pull data (specifically, the geocode) from the table. Would it be possible to utilize a covering index to perform an index-only scan?
To test this, let's drop the current index and add a covering index:
DROP INDEX visits_visitor_visited_at_idx;
CREATE UNIQUE INDEX visits_visitor_visited_at_geocode_idx ON visits (visitor, visited_at) INCLUDE (geocode);
VACUUM ANALYZE visits;
Now when I run:
EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
visited_at BETWEEN '2012-10-01' AND '2012-10-05' AND
visitor IN (
'f806ab28-4087-4f23-aab3-6facb044d1fb',
'21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47',
'e881e88a-902a-4513-94b1-bb2b797753a4',
'bcc2eb96-9be9-42d8-977f-79087debf30e',
'905cd23a-3f74-4728-a1d3-ecde205a7da1'
)
ORDER BY visitor, visited_at;
I should see a query plan like:
Index Only Scan using visits_visitor_visited_at_geocode_idx on visits (cost=0.56..24.25 rows=66 width=40) (actual time=0.026..0.076 rows=60 loops=1)
Index Cond: ((visitor = ANY ('{f806ab28-4087-4f23-aab3-6facb044d1fb,21f9a17c-b4bc-4e06-bbfd-2703f0f5bd47,e881e88a-902a-4513-94b1-bb2b797753a4,bcc2eb96-9be9-42d8-977f-79087debf30e,905cd23a-3f74-4728-a1d3-ecde205a7da1}'::uuid[])) AND (visited_at >= '2012-10-01 00:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-05 00:00:00-04'::timestamp with time zone))
Heap Fetches: 0
Planning Time: 0.151 ms
Execution Time: 0.104 ms
Success - we are able to perform an index-only scan, and directionally, the query executes more quickly too!
Covering Indexes: Incredibly Helpful, But Know Your Data
As the above example shows, in the correct scenarios, covering indexes can help speed up how your applications interface with PostgreSQL. However, like with any kind of indexing technique, you need to understand how people and programs are interfacing with your database in order to make the correct index selection, and you must absolutely test your architectural choices.
Related Articles
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read
- PostGIS Day 2024 Summary
8 min read
- Crunchy Data Warehouse: Postgres with Iceberg for High Performance Analytics
8 min read
- Loading the World! OpenStreetMap Import In Under 4 Hours
6 min read