PostGIS Performance: Data Sampling

Paul Ramsey

3 min readMore by this author

One of the temptations database users face, when presented with a huge table of interesting data, is to run queries that interrogate every record. Got a billion measurements? What’s the average of that?!

One way to find out is to just calculate the average.

SELECT avg(value) FROM mytable;

For a billion records, that could take a while!

Fortunately, the “Law of Large Numbers” is here to bail us out, stating that the average of a sample approaches the average of the population, as the sample size grows. And amazingly, the sample does not even have to be particularly large to be quite close.

Here’s a table of 10M values, randomly generated from a normal distribution. We know the average is zero. What will a sample of 10K values tell us it is?

CREATE TABLE normal AS
  SELECT random_normal(0,1) AS values
    FROM generate_series(1,10000000);

We can take a sample using a sort, or using the random() function, but both of those techniques first scan the whole table, which is exactly what we want to avoid.

Instead, we can use the PostgreSQL TABLESAMPLE feature, to get a quick sample of the pages in the table and an estimate of the average.

SELECT avg(values)
  FROM normal TABLESAMPLE SYSTEM (1);

I get an answer – 0.0031, very close to the population average – and it takes just 43 milliseconds.

Can this work with spatial? For the right data, it can. Imagine you had a table that had one point in it for every person in Canada (36 million of them) and you wanted to find out how many people lived in Toronto (or this red circle around Toronto).

SELECT count(*)
  FROM census_people
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The answer is 5,010,266, and it takes 7.2 seconds to return. What if we take a 10% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (10)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 10%, and the answer comes back as 508,292 (near one tenth of our actual measurement) in 2.2 seconds. What about a 1% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (1)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 1%, and the answer comes back as 50,379 (near one hundredth of our actual measurement) in 0.2 seconds. Still a good estimate!

Is this black magic? No, the TABLESAMPLE SYSTEM mode gets its speed by reading pages randomly. In our last example, it randomly chose 1% of the pages. Here’s what that looks like in Toronto.

See in particular how blotchy the data are in the suburban areas outside the circle. The data in the table are not randomly distributed to the pages, they came from the census data in order, and ended up loaded into the database in order. So for any given database page, the actual rows in the page will tend to be near to one another.

This works for this example because the amount of data is high, and the area we are summarizing is a large proportion of the total data – a seventh of the Canadian population lives in that circle.

If we were summarizing a smaller area, the results would not have been so good.

The TABLESAMPLE SYSTEM is a powerful tool, but you have to be sure that any given page has a random selection of the data you are sampling for. Our random normal example worked perfectly, because the data were perfectly random. A sample of time series data would not work well for sample time windows (the data were probably stored in order of arrival) but might work for sampling some other value.