Postgres Indexes, Selectivity, and Statistics

Paul Ramsey

6 min read

The sheer cleverness of relational databases is often discounted because we so frequently use them for very simple data management tasks.

Serialize an object into a row, store with unique key. yawwwn

Search for unique key, deserialize row into an object. yawwwwwwn

The real power of relational databases is juggling "relations" (aka tables) in large numbers and figuring out on-the-fly the most effective way to filter out rows and find an answer.

diagram of query plan from payment to append to sort to merge left join

PostgreSQL has an undeniably clever query planning system that auto-tunes based on the data in the system. It samples tables to gain statistics about the distribution of data, and uses those statistics to choose the order of joins and filters applied to the data for the most efficient query execution.

Even more amazing, the query planning system is modular enough to integrate user-defined data types, like the geometry and geography types in PostGIS. So complex queries involving spatial filters are also correctly planned on-the-fly.

Statistics Targets

Mostly, this just works magically. The system has a rough idea of the selectivity of any given filter or join condition, and can use that estimate to choose the most efficient order of operations to execute multi-table and multi-filter queries over complex collections of relational tables.

But what about when things go wrong?

a cat yawning meme with the words "more statistics"

By default, a PostgreSQL data ships with a default_statistics_target of 100. This means that to populate the statistics for a column the database will draw a sample of 300 * default_statistics_target = 30000 rows.

The system will then use that data to populate the "common value list" and column histogram with roughly default_statistics_target entries.

A large table of keys arranged in a Pareto distribution can make things hard for the statistics system. There may be more common keys than can easily fit into the "common value list", so joins get poorly planned for example.

a steep assymptote showing rapidly diminishing values approaching one

Fortunately, there are some table specific knobs you can turn.

The statistics target on each table and column can be individually changed, so that more data are sampled, and more granular statistics are gathered.

ALTER TABLE people ALTER COLUMN age SET STATISTICS 200;

Gathering more statistics is usually sufficient to correct planning behavior. However, in special cases (like when a query uses multiple filters on strongly correlated columns) it might be necessary to pull out "extended statistics" to gather even more data for planning.

Indexes Make (Some) Things Faster

The thing about magic is that underneath there is usually some complex machinery to make it all work transparently.

Pretty much every database user understands that there's a thing called an "index" and that adding an "index" to a column will make searches on that column faster.

For most uses, this rule-of-thumb is correct: indexes make random access of a small number of items faster, at the expense of making inserts and updates to a table slightly slower.

What happens, though, if you are accessing a large number of items from a table?

xkcd carton showing indexes being constrainted to one fractile

An index is a tree. It's fast to find one item traversing through a tree, it's called an "index scan". But finding a lot of items involves running through the tree over and over and over. It is slow to index scan all the records in a tree.

An efficient index scan query will have a narrow filter:

SELECT * FROM people WHERE age = 4;

Conversely, reading a table from from front to back can be quite fast, if you are interested in most of the records. This is called a "sequence scan".

An efficient sequence scan query will have a wide filter:

SELECT * FROM people WHERE age BETWEEN 45 AND 95;

For any given filter, a relational database will figure out whether the filter is most efficiently run as an index scan or a sequence scan. But how?

Selectivity

The "selectivity" of a filter is a measure of what proportion of the records will be returned by the filter. A filter that returns only 1% of the records is "very selective", and vice versa.

An ordinary filter on numbers or strings or dates might define an equality or a range as a filter. A spatial filter on geographic objects defines a bounding box as a filter. A filter is just a true/false test that every object in the column can be subjected to.

Here's two spatial filters, which one do you suppose is more selective?

two boxes with one smaller than the other

The red one, right? Just like our SQL example above on ages, the smaller rectangle must return fewer records. Except not necessarily. Suppose our data table looked like this.

a dot plot with most dots inside the smaller box

So the blue box is selective and should use an index scan and the red one should use a sequence scan.

But wait, how does the database know which box is selective, without running a query and actually applying the filters to the data?

Statistics

When the ANALYZE command is run, the database will gather "statistics" for every column that has an index on it. (The "autovacuum" system that runs in the background of every PostgreSQL cluster is also an "autoanalyze", gathering statistics at the same time it does table maintenance.)

For data types like integers, real numbers, text and dates the system will gather a "common value list" of values that show up frequently in the table, and a histogram of the frequency of values in different ranges.

For spatial data types, the system builds a spatial histogram of frequencies of occurrance. So, for our example data set, something like this.

a dot plot laid over a grid with count of number of dots in each square

Instead of being a potentially unconstrained collection of maybe millions of spatial objects, the database now just has to make a quick summary of a few cells in a histogram to get an estimate of how many features a given query filter will return.

the previous box diagram overlaing the grid overlaying the dot plot

The spatial statistics analyzer in PostGIS is actually a little more sophisticated than that: before filling out the histogram it adjusts the cell widths for each dimension, to try and catch more resolution in places where the data is more dense.

dot plot over a grid

All these statistics do have a cost. Increasing statistics targets will make the ANALYZE command run a little slower, as it samples more data. It will also make the planning stage of queries slower, as the larger collection of statistics need to be read through and compared to the filters and joins in the SQL. However, for complex BI queries, more statistics are almost always worth while to get a better plan for a complex query.

Conclusion

  • PostgreSQL has an undeniably clever query planning system which auto-tunes based on the data in the system.
  • PostGIS plugs right into that clever system and provides spatial selectivity estimates to ensure good plans for spatial queries too.
  • For even more complex situations, such as selectivity estimates on highly correlated columns, PostgreSQL also includes an "extended statistics" system to capture those cases.
Avatar for Paul Ramsey

Written by

Paul Ramsey

June 23, 2022 More by this author