PostGIS vs GPU: Performance and Spatial Joins

Paul Ramsey

5 min read
Crunchy Bridge for Analytics
Interested in Spatial analytics? You can now connect Postgres and PostGIS to CSV, JSON, Parquet / GeoParquet, Iceberg, and more with Crunchy Bridge for Analytics.

Every once in a while, a post shows up online about someone using GPUs for common spatial analysis tasks, and I get a burst of techno-enthusiasm. Maybe this is truly the new way!

This week it was a post on GPU-assisted spatial joins that caught my eye. In summary, the author took a 9M record set of parking infractions data from Philadelphia and joined it to a 150 record set of Philadelphia neighborhoods. The process involved building up a little execution engine in Python. It was pretty manual but certainly fast.

philly02

I wondered: how bad would execution on a conventional environment like PostGIS be, in comparison?

Follow along, if you like!

Server Setup

I grabbed an 8-core cloud server with PostgreSQL 14 and PostGIS 3 from Crunchy Bridge DBaaS, like so:

bridge01

Data Download

Then I pulled down the data.

#
# Download Philadelphia parking infraction data
#
curl "https://phl.carto.com/api/v2/sql?filename=parking_violations&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20parking_violations%20WHERE%20issue_datetime%20%3E=%20%272012-01-01%27%20AND%20issue_datetime%20%3C%20%272017-12-31%27" > phl_parking.csv

#
# Download Philadelphia neighborhoods
#
wget https://github.com/azavea/geo-data/raw/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.zip

#
# Convert neighborhoods file to SQL
#
unzip Neighborhoods_Philadelphia.zip
shp2pgsql -s 102729 -D Neighborhoods_Philadelphia phl_hoods > phl_hoods.sql

# Connect to database
psql -d postgres

Data Loading and Preparation

Then I loaded the CSV and SQL data files into the database.

-- Set up PostGIS
CREATE EXTENSION postgis;

-- Read in the neighborhoods SQL file
\i phl_hoods.sql

-- Reproject the neighborhoods to EPSG:4326 to match the
-- parking data
ALTER TABLE phl_hoods
  ALTER COLUMN geom
  TYPE Geometry(multipolygon, 4326)
  USING st_transform(geom, 4326);

-- Create parking infractions table
CREATE TABLE phl_parking (
    anon_ticket_number integer,
    issue_datetime timestamptz,
    state text,
    anon_plate_id integer,
    division text,
    location text,
    violation_desc text,
    fine float8,
    issuing_agency text,
    lat float8,
    lon float8,
    gps boolean,
    zip_code text
    );

-- Read in the parking data
\copy phl_parking FROM 'phl_parking.csv' WITH (FORMAT csv, HEADER true);

This is the first step that takes any time. Reading in the 9M records from CSV takes about 29 seconds.

Because the parking data lacks a geometry column, I create a second table that does have a geometry column, and then index that.

CREATE TABLE phl_parking_geom AS
  SELECT anon_ticket_number,
    ST_SetSRID(ST_MakePoint(lon, lat), 4326) AS geom
  FROM phl_parking ;

ANALYZE phl_parking_geom;

Making a second copy while creating a geometry column takes about 24 seconds.

Finally, to carry out a spatial join, I will need a spatial index on the parking points. In this case I follow my advice about when to use different spatial indexes and build a "spgist" index on the geometry.

CREATE INDEX phl_parking_geom_spgist_x
  ON phl_parking_geom USING spgist (geom);

This is the longest process, and takes about 60 seconds.

Running the Query

The spatial join query is a simple inner join using ST_Intersects as the join condition.

SELECT h.name, count(*)
  FROM phl_hoods h
  JOIN phl_parking_geom p
    ON ST_Intersects(h.geom, p.geom)
  GROUP BY h.name;

Before running it, though, I took a look at the EXPLAIN output for the query, which is this.

HashAggregate  (cost=4031774.83..4031776.41 rows=158 width=20)
   Group Key: h.name
   ->  Nested Loop  (cost=0.42..4024339.19 rows=1487128 width=12)
         ->  Seq Scan on phl_hoods h  (cost=0.00..30.58 rows=158 width=44)
         ->  Index Scan using phl_parking_geom_spgist_x on phl_parking_geom p
             (cost=0.42..25460.90 rows=941 width=32)
               Index Cond: (geom && h.geom)
               Filter: st_intersects(h.geom, geom)

This is all very good, a nested loop on the smaller neighborhood table against the large indexed parking table, except for one thing: I spun up an 8 core server and my plan has no parallelism!

What is up?

SHOW max_worker_processes;            -- 8
SHOW max_parallel_workers;            -- 8
SHOW max_parallel_workers_per_gather; -- 2
SHOW min_parallel_table_scan_size;    -- 8MB

Aha! That minimum table size for parallel scans seems large compared to our neighborhoods table.

select pg_relation_size('phl_hoods');

-- 237568

Yep! So, first we will set the min_parallel_table_scan_size to 1kb, and then increase the max_parallel_workers_per_gather to 8 and see what happens.

SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = '1kB';

The EXPLAIN output for the spatial join now parallelized, but unfortunately only to 4 workers.

Finalize GroupAggregate  (cost=1319424.81..1319505.22 rows=158 width=20)
   Group Key: h.name
   ->  Gather Merge  (cost=1319424.81..1319500.48 rows=632 width=20)
         Workers Planned: 4
         ->  Sort  (cost=1318424.75..1318425.14 rows=158 width=20)
               Sort Key: h.name
               ->  Partial HashAggregate
                   (cost=1318417.40..1318418.98 rows=158 width=20)
                     Group Key: h.name
                     ->  Nested Loop
                         (cost=0.42..1018842.71 rows=59914937 width=12)
                           ->  Parallel Seq Scan on phl_hoods h
                               (cost=0.00..29.39 rows=40 width=1548)
                           ->  Index Scan using phl_parking_geom_spgist_x
                               on phl_parking_geom p
                               (cost=0.42..25460.92 rows=941 width=32)
                                 Index Cond: (geom && h.geom)
                                 Filter: st_intersects(h.geom, geom)

Now we still get a nested loop join on neighborhoods, but this time the planner recognizes that it can scan the table in parallel. Since the spatial join is fundamentally CPU bound and not I/O bound this is a much better plan choice.

SELECT h.name, count(*)
  FROM phl_hoods h
  JOIN phl_parking_geom p
    ON ST_Intersects(h.geom, p.geom)
  GROUP BY h.name;

The final query running with 4 workers takes 24 seconds to execute the join of the 9 million parking infractions with the 150 neighborhoods.

philly01

This is pretty good!

Is it faster than the GPU though? No, the GPU post says his custom python/GPU solution took just 8 seconds to execute. Still, the differences in environment are important:

  • The data in PostgreSQL is fully editable in place, by multiple users and applications.
  • The execution plan in PostgreSQL is automatically optimized. If the next query involves neighborhood names and plate numbers, it will be just as fast and involve no extra custom code.
  • The PostGIS engine is capable of answering 100s of other spatial questions about the database.
  • With a web service like pg_tileserv or pg_featureserv, the data are immediately publishable and remote queryable.

Conclusion

Basically, it is very hard to beat a bespoke performance solution with a general purpose tool. Yet, PostgreSQL/PostGIS comes within "good enough" range of a high end GPU solution, so that counts as a "win" to me.

Avatar for Paul Ramsey

Written by

Paul Ramsey

March 10, 2022 More by this author