Postgres Indexes, Selectivity, and Statistics
The sheer cleverness of relational databases is often discounted because we so frequently use them for very simple data management tasks.
Serialize an object into a row, store with unique key. yawwwn
Search for unique key, deserialize row into an object. yawwwwwwn
The real power of relational databases is juggling "relations" (aka tables) in large numbers and figuring out on-the-fly the most effective way to filter out rows and find an answer.
PostgreSQL has an undeniably clever query planning system that auto-tunes based on the data in the system. It samples tables to gain statistics about the distribution of data, and uses those statistics to choose the order of joins and filters applied to the data for the most efficient query execution.
Even more amazing, the query planning system is modular enough to integrate
user-defined data types, like the
geography types in PostGIS.
So complex queries involving spatial filters are also correctly planned
Mostly, this just works magically. The system has a rough idea of the selectivity of any given filter or join condition, and can use that estimate to choose the most efficient order of operations to execute multi-table and multi-filter queries over complex collections of relational tables.
But what about when things go wrong?
By default, a PostgreSQL data ships with a
This means that to populate the statistics for a column the database will draw a
300 * default_statistics_target = 30000 rows.
The system will then use that data to populate the "common value list" and
column histogram with roughly
A large table of keys arranged in a Pareto distribution can make things hard for the statistics system. There may be more common keys than can easily fit into the "common value list", so joins get poorly planned for example.
Fortunately, there are some table specific knobs you can turn.
The statistics target on each table and column can be individually changed, so that more data are sampled, and more granular statistics are gathered.
ALTER TABLE people ALTER COLUMN age SET STATISTICS 200;
Gathering more statistics is usually sufficient to correct planning behavior. However, in special cases (like when a query uses multiple filters on strongly correlated columns) it might be necessary to pull out "extended statistics" to gather even more data for planning.
Indexes Make (Some) Things Faster
The thing about magic is that underneath there is usually some complex machinery to make it all work transparently.
Pretty much every database user understands that there's a thing called an "index" and that adding an "index" to a column will make searches on that column faster.
For most uses, this rule-of-thumb is correct: indexes make random access of a small number of items faster, at the expense of making inserts and updates to a table slightly slower.
What happens, though, if you are accessing a large number of items from a table?
An index is a tree. It's fast to find one item traversing through a tree, it's called an "index scan". But finding a lot of items involves running through the tree over and over and over. It is slow to index scan all the records in a tree.
An efficient index scan query will have a narrow filter:
SELECT * FROM people WHERE age = 4;
Conversely, reading a table from from front to back can be quite fast, if you are interested in most of the records. This is called a "sequence scan".
An efficient sequence scan query will have a wide filter:
SELECT * FROM people WHERE age BETWEEN 45 AND 95;
For any given filter, a relational database will figure out whether the filter is most efficiently run as an index scan or a sequence scan. But how?
The "selectivity" of a filter is a measure of what proportion of the records will be returned by the filter. A filter that returns only 1% of the records is "very selective", and vice versa.
An ordinary filter on numbers or strings or dates might define an equality or a range as a filter. A spatial filter on geographic objects defines a bounding box as a filter. A filter is just a true/false test that every object in the column can be subjected to.
Here's two spatial filters, which one do you suppose is more selective?
The red one, right? Just like our SQL example above on ages, the smaller rectangle must return fewer records. Except not necessarily. Suppose our data table looked like this.
So the blue box is selective and should use an index scan and the red one should use a sequence scan.
But wait, how does the database know which box is selective, without running a query and actually applying the filters to the data?
ANALYZE command is run, the database will gather "statistics" for
every column that has an index on it. (The "autovacuum" system that runs in the
background of every PostgreSQL cluster is also an "autoanalyze", gathering
statistics at the same time it does table maintenance.)
For data types like integers, real numbers, text and dates the system will gather a "common value list" of values that show up frequently in the table, and a histogram of the frequency of values in different ranges.
For spatial data types, the system builds a spatial histogram of frequencies of occurrance. So, for our example data set, something like this.
Instead of being a potentially unconstrained collection of maybe millions of spatial objects, the database now just has to make a quick summary of a few cells in a histogram to get an estimate of how many features a given query filter will return.
The spatial statistics analyzer in PostGIS is actually a little more sophisticated than that: before filling out the histogram it adjusts the cell widths for each dimension, to try and catch more resolution in places where the data is more dense.
All these statistics do have a cost. Increasing statistics targets will make the
ANALYZE command run a little slower, as it samples more data. It will also
make the planning stage of queries slower, as the larger collection of
statistics need to be read through and compared to the filters and joins in the
SQL. However, for complex BI queries, more statistics are almost always worth
while to get a better plan for a complex query.
- PostgreSQL has an undeniably clever query planning system which auto-tunes based on the data in the system.
- PostGIS plugs right into that clever system and provides spatial selectivity estimates to ensure good plans for spatial queries too.
- For even more complex situations, such as selectivity estimates on highly correlated columns, PostgreSQL also includes an "extended statistics" system to capture those cases.
June 23, 2022 •More by this author