PostGIS Performance: Improve Bounding Boxes...

In the third installment of the PostGIS Performance series, I wanted to talk about performance around bounding boxes.

Geometry data is different from most column types you find in a relational database. The objects in a geometry column can be wildly different in the amount of the data domain they cover, and the amount of physical size they take up on disk.

The data in the “admin0” Natural Earth data range from the 1.2 hectare Vatican City, to the 1.6 billion hectare Russia, and from the 4 point polygon defining Serranilla Bank to the 68 thousand points of polygons defining Canada.

SELECT ST_NPoints(geom) AS npoints, name
FROM admin0
ORDER BY 1 DESC LIMIT 5;

SELECT ST_Area(geom::geography) AS area, name
FROM admin0
ORDER BY 1 DESC LIMIT 5;

As you can imagine, polygons this different will have different performance characteristics:

Physically large objects will take longer to work with. To pull off the disk, to scan, to calculate with.
Geographically large objects will cover more other objects, and reduce the effectiveness of your indexes.

Your spatial indexes are “r-tree” indexes, where each object is represented by a bounding box.

The bounding boxes can overlap, and it is possible for some boxes to cover a lot of the dataset.

For example, here is the bounding box of France.

What?! How is that France? Well, France is more than just the European parts, it also includes the island of Reunion, in the southern Indian Ocean, and the island of Guadaloupe, in the Caribbean. Taken together they result in this very large bounding box.

Such a large box makes a poor addition to the spatial index of all the objects in “admin0”. I could be searching in with a query key in the middle of the Atlantic, and the index would still be telling me “maybe it is in France?”.

For this testing, I have made a synthetic dataset of one million random points covering the whole world.

CREATE TABLE random_normal AS
  SELECT id,
    ST_Point(
      random_normal(0, 180),
      random_normal(0, 80),
      4326) AS geom
  FROM generate_series(0, 1000000) AS id;


CREATE INDEX random_normal_geom_x ON random_normal USING GIST (geom);

The un-altered bounds of “admin0”, the bounds that will be used to run the spatial join, look like this. Lots of overlap, lots of places where they bounds cover areas the polygons do not.

The baseline time to do a spatial join using the un-altered “admin0” data is 9 seconds.

SELECT Count(*), admin0.name
  FROM admin0 JOIN random_normal
    ON ST_Intersects(random_normal.geom, admin0.geom)
  GROUP BY admin0.name;

What if, instead of joining against the raw “admin0” – which includes weird cases like France and a Canada with hundreds of islands – we first decompose every object into the singular polygons that make it up, using ST_Dump.

The decomposed objects cover far less ocean, and much more accurately represent the polygons they are proxying for. And the time – including the cost of decomposing the objects – to do a full join on the 1M points falls to 3.8 seconds.

WITH polys AS  (
  SELECT (ST_Dump(geom)).geom AS geom, name
  FROM admin0
)
SELECT Count(*), polys.name
FROM polys JOIN random_normal
ON ST_Intersects(random_normal.geom, polys.geom)
GROUP BY polys.name;

There is still a lot of ocean being queried here, and also some of the polygons are not just very spatially large, but include a lot of vertices. What if we make the polygons smaller yet by chopping them up ST_Subdivide?

These bounds are almost perfect, they cover very little of the ocean, and they also have reduced the maximum memory size of any polygon to no more than 256 vertices. And the performance, even including the very expensive subdivision step, gets faster yet.

WITH polys AS (
  SELECT ST_Subdivide(geom,128) AS geom, name FROM admin0
)
SELECT Count(*), polys.name
FROM polys JOIN random_normal
ON ST_Intersects(random_normal.geom, polys.geom)
GROUP BY polys.name;

The final query takes just 1.8 seconds, twice as fast as the simple boxes, and 4 times faster than a naive spatial join. For smaller collections of points, the naive approach can work as fast as the subdivision, but for this 1M point test set the overhead of doing the subdivision is still far less than the gains from using the more effective bounds.

Investing computation into creating better, smaller, and simpler geometries pays off significantly for large datasets by making the spatial index much more effective.

Need more PostGIS?
Join us this year on November 20 for PostGIS Day 2025, a free, virtual, community event about open source geospatial!

Latest Articles

PostGIS Performance: Improve Bounding Boxes with Decompose and Subdivide

Related Articles