PostGIS meets DuckDB: Crunchy Bridge for Analytics goes Spatial
Crunchy Data is excited to announce the next major feature release for Crunchy Bridge for Analytics: Geospatial Analytics.
We have developed a variety of features to connect Postgres and PostGIS to S3 and public web servers to make spatial data access easier than ever.
This release includes:
- Creating an analytics table directly from a geospatial data set by providing only the URL, for ad-hoc queries and data transformations.
- Creating a regular PostGIS table directly from a URL.
- Automatic mapping of geospatial columns into PostGIS geometry type.
- Support for GeoParquet, GeoJSON, Shapefile (zip), Geopackage, WKT in CSV, and more.
- Delegate PostGIS functions and operators to DuckDB spatial for fast queries on GeoParquet.
Together, these make Crunchy Bridge for Analytics an easy-to-use and powerful platform for working with geospatial data.
Query almost any geospatial data set with one easy command
PostGIS is the most popular and versatile geospatial data processing tool available, and the underlying GEOS library powers most other geospatial applications. Crunchy has a long history in PostGIS and geospatial, and we’re lucky to count geospatial legends Paul Ramsey and Martin Davis (see: PostGIS, GEOS, JTS, pg_featureserv, pg_tileserv, and more) among our colleagues.
Crunchy Bridge for Analytics enhances PostgreSQL with the ability to run fast analytical queries on data files in S3 and public web servers, with queries accelerated using DuckDB and caching on local NVMe drives. DuckDB also has a spatial extension built on top of GEOS and inspired by PostGIS.
It was natural for us to look for ways in which we can take advantage of the capabilities offered by Bridge for Analytics for geospatial use cases. We soon realized that one of the challenges of geospatial data is the wide variety of formats and data sources, and the relative difficulty of getting them into PostgreSQL.
By leveraging the capabilities built into Bridge for Analytics, we’ve managed to simplify the experience of accessing any geospatial data set via s3 or https in PostgreSQL down to a very simple create foreign table
command:
-- Create a table from the overture buildings data set,
-- auto-infers columns, caches GeoParquet files in the background
create foreign table ov_buildings ()
server crunchy_lake_analytics
options (path 's3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=*/*.parquet');
-- Immediately start querying the >2 billion row data set,
-- uses range requests until files get cached
select (names).primary as building, st_area(geometry, true) as surface_m2
from ov_buildings
where (names).primary is not null
and (bbox).xmin <= 7.2275
and (bbox).xmax >= 3.3583
and (bbox).ymin <= 53.6316
and (bbox).ymax >= 50.7504
order by st_area(geometry) desc limit 1;
┌─────────────────────────┬───────────────────┐
│ building │ surface_m2 │
├─────────────────────────┼───────────────────┤
│ Bloemenveiling Aalsmeer │ 449485.2894157285 │
└─────────────────────────┴───────────────────┘
(1 row)
Time: 10169.907 ms (00:10.170)
Queries on GeoParquet are significantly accelerated by DuckDB, and files will get automatically cached in the background. For instance, the ~600GB Overture data set can be fully cached on larger analytics clusters, which makes analytics tables a practical tool for building applications with Overture.
Support for geospatial formats is not limited to GeoParquet. You can directly create a table from Shapefile (in zip), GeoJSON, Geopackage, Geodatabase, KML, and many other file formats supported by the GDAL library, and you can use public URLs to get data directly from the source.
-- Load US state boundaries from a compressed TIGER/Line Shapefile
create foreign table state ()
server crunchy_lake_analytics
options (format 'gdal', path 'https://www2.census.gov/geo/tiger/TIGER2023/STATE/tl_2023_us_state.zip');
-- Inspect auto-inferred schema
\d state
Foreign table "public.state"
┌──────────┬──────────┬───────────┬──────────┬─────────┬─────────────┐
│ Column │ Type │ Collation │ Nullable │ Default │ FDW options │
├──────────┼──────────┼───────────┼──────────┼─────────┼─────────────┤
│ region │ text │ │ │ │ │
│ division │ text │ │ │ │ │
...
│ geom │ geometry │ │ │ │ │
└──────────┴──────────┴───────────┴──────────┴─────────┴─────────────┘
Server: crunchy_lake_analytics
FDW options: (path 'https://www2.census.gov/geo/tiger/TIGER2023/STATE/tl_2023_us_state.zip');
-- What are the biggest states?
select name, st_area(geom, true)/1000000 area_in_km2
from state
order by 2 desc limit 10;
┌────────────┬───────────────────┐
│ name │ area_in_km2 │
├────────────┼───────────────────┤
│ Alaska │ 1724364.048632004 │
│ Texas │ 695668.3746231933 │
│ California │ 423965.0992563212 │
│ Montana │ 380840.4022201886 │
│ New Mexico │ 314925.0846268172 │
│ Arizona │ 295220.1394989747 │
│ Nevada │ 286376.9475553515 │
│ Colorado │ 269604.5427509235 │
│ Oregon │ 254799.4066699504 │
│ Wyoming │ 253326.2430649384 │
└────────────┴───────────────────┘
(10 rows)
Time: 802.659 ms
Queries on GDAL data sets are currently slower than on GeoParquet, but the files will be immediately cached on disk when creating the table, so they are only downloaded once. On very rare occasions when the server is replaced, or after the file was evicted from cache, the file is automatically re-downloaded on demand.
There are several existing tools for loading data into PostGIS, though they are relatively laborious, and usually involve downloading large files to your computer and subsequently re-uploading the output. The ogr_fdw extension by Paul is probably the most versatile geospatial data access option available for PostgreSQL, though it will re-request remote data files for every query and is hence more suitable for accessing remote databases and web services with filter pushdown.
Building geospatial data pipelines with PostGIS
Once you’ve created an analytics table, you can start building a data transformation pipeline to get the data into the shape you want via (materialized) views.
For instance, a very simple pipeline might look like:
-- Create an analytics table for ad-hoc queries and transformations
create foreign table state ()
server crunchy_lake_analytics
options (path 'https://www2.census.gov/geo/tiger/TIGER2023/STATE/tl_2023_us_state.zip');
-- Create a materialized view for rendering a simple bar chart with sub-millisecond query time
create materialized view states_by_size as
select stusps, name, st_area(geom, true)/1000000 area_in_km2 from state;
You can also combine multiple data sets with spatial joins and compose views:
-- National Forest System boundaries (Shapefile)
create foreign table forests ()
server crunchy_lake_analytics
options (path 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.AdministrativeForest.zip');
-- Fire occurence points in the US (Shapefile)
create foreign table fires ()
server crunchy_lake_analytics
options (path 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.MTBS_FIRE_OCCURRENCE_PT.zip');
-- Only consider fires in national forests in 2022
create view nfs_fires_in_2022 as
select fires.*, forests.adminfores
from forests, fires
where st_within(fires.geom, forests.geom)
and date_trunc('year', ig_date) = '2022-01-01';
-- Find the forests which had fires in 2022
create view forests_with_fires_in_2022 as
select *
from forests
where adminfores in (
select adminfores from nfs_fires_in_2022
);
Finally, we also made it very straight-forward to create a regular heap table with a PostGIS geometry column directly from a public geospatial data set by setting the load_from
option in a create table
command.
-- Create a regular table with an index from a Shapefile zip (note: WITH uses = syntax)
create table forests ()
with (load_from = 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.AdministrativeForest.zip');
-- Add a spatial index
create index on forests using gist (geom);
-- You can also load data into an existing table using COPY, assuming the schemas match
copy forests from 'https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.AdministrativeForest.zip';
You can see we have a lot of options here, so some general guidance:
create foreign table + create view
- for ad-hoc queries and transformations on current datacreate foreign table + create materialized view + create index
- for repeated selective queries on data sets that occasionally need to be refreshedcreate table with load_from + create index
- loading the table data directly into PostGIS as a one-off
Overall, our aim is to give you a powerful toolbox for geospatial data, while also simplifying common scenarios down to very simple operations (create foreign table, create view, start rendering).
Connecting QGIS to Crunchy Bridge for Analytics
Since Crunchy Bridge for Analytics is just PostgreSQL, you can directly create a connection to your Analytics cluster from QGIS and add your (foreign) tables and views as layers, which means you can very quickly go from geospatial data set to visualization.
For example, the 4 commands for creating the forest views from the previous section can give you a map of national forests which had fires in 2022 and where those fires occurred:
Note that QGIS by default requires that the first column of the table is unique. This is quite often the case, but when it’s not you may need to create a view to reorder the columns or add a unique value.
PostGIS combined with DuckDB spatial
Under the covers, Crunchy Bridge for Analytics takes advantage of DuckDB spatial. It is an awesome DuckDB extension, though it is still in an early stage of development. We map PostGIS functions and operators to DuckDB spatial functions where possible to accelerate analytical queries, and otherwise pull geometries into PostGIS, such that any query works as expected.
By default, geometry values in analytics tables have SRID set to 0/unspecified. You can set the SRID using st_setsrid
as usual to make functions such as st_distance
return the right units, but that will happen in PostgreSQL and transferring the geometries from DuckDB to PostgreSQL might slow down some queries. On the other hand, you can easily transfer the data set into a regular table or materialized view with an index if needed.
For queries on (Geo)Parquet, the speedup from DuckDB can be quite significant, so it may be worth avoiding SRIDs. You can check explain verbose
to see which part of the query is delegated to DuckDB.
Get started with Geospatial Analytics and tell us your thoughts!
We believe this initial geospatial analytics release helps to bridge the gap of going from raw geospatial data files into a structured/indexed PostGIS table. These new features can help bootstrap many geospatial applications.
We’re excited to share this new feature with customers and get feedback and continue to build out the next generation of spatial analytics.
Geospatial analytics is available today on Crunchy Bridge, and it only takes a few minutes to get started. See our spatial analytics documentation for additional details.