Crunchy Bridge Adds Iceberg to Postgres & Powerful Analytics Features

Marco Slot

11 min read

In April we launched Crunchy Bridge for Analytics, which is a managed PostgreSQL option that enables fast and seamless querying of your data lake. Our initial release was focused on building a rock solid foundation for high performance analytics in PostgreSQL. We have since been hard at work turning it into a comprehensive analytics solution.

Our goals in building Crunchy Bridge for Analytics are to:

  • Make it very easy to query data files (incl. Parquet/CSV/JSON/Iceberg) in object stores like S3 from PostgreSQL, as well as easy data import/export.
  • Offer best-in-class analytics performance, for example by integrating DuckDB into PostgreSQL and using local storage as a cache, and advanced query planning techniques.
  • Enable all the versatility of PostgreSQL features, extensions, and tools to build sophisticated analytics pipelines.

In the spirit of those goals, the July 2024 release (v1.1) available today brings the following improvements:

  • Querying Iceberg tables
  • Full SQL “pushdown” making window functions, CTEs, subqueries, and complex joins even faster
  • Parquet nested type support for structs, arrays and maps
  • Automatic S3 region selection
  • Write-through caching so as new files are generated using COPY, they are sent to a local NVMe drive and Amazon S3
  • Listing files in S3
  • Read data from Hugging Face URLs and open data sets

Let’s look at each feature in more detail.

Querying Iceberg from PostgreSQL

Most analytics data lives in object stores like Amazon S3, organized into directories of files in Parquet/CSV/JSON, or other file formats. Individual data sets are typically partitioned across many files in a directory. Over time more files get added, removed, or replaced. Crunchy Bridge for Analytics lets you query across many files using wildcard URLs like s3://mybucket/prefix/*.parquet, by listing all the files at the start of the query.

Querying a list of files does have some downsides compared to tables in a relational database. It’s not possible to atomically add/remove/replace several files at once on an object storage system like Amazon S3. When files are changed, concurrent queries might see strange intermediate states. A directory of files also does not have a well-defined schema or history that can be reliably imported into a query engine; there can be data model anomalies.

The Iceberg table format offers a solution to these problems by adding metadata files to S3 that precisely describe the schema of the table, which data files are currently part of the table, and some statistics on each file. The top-level .metadata.json file describes the current version of the table and references all other metadata files. Iceberg is quickly becoming a popular industry standard for managing tabular data in data lakes.

We made it very easy to query or import Iceberg tables in your S3 buckets by using the top-level metadata file as the path, with an optional format 'iceberg':

-- Prepare an analytics table for querying Iceberg using DuckDB
create foreign table myberg ()
server crunchy_lake_analytics
options (path 's3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/00000-bb061089-9ee5-44f5-a1ae-e74696a049d4.metadata.json');

select count(*) from myberg;
┌──────────┐
│  count   │
├──────────┤
│ 59986052 │
└──────────┘
(1 row)

Time: 57.289 ms

-- or, load Iceberg contents into a new heap table
create table lineitem_heap()
with (load_from = 's3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/00000-bb061089-9ee5-44f5-a1ae-e74696a049d4.metadata.json');

-- or, load Iceberg contents into an existing table
copy lineitem_heap from 's3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/00000-bb061089-9ee5-44f5-a1ae-e74696a049d4.metadata.json' with (format 'iceberg');

Iceberg tables can consist of a large number of data and metadata files. On Crunchy Bridge for Analytics, all of those files are automatically cached in the background on local NVMe drives to increase performance and lower your data transfer costs.

-- list the files in cache
select path, file_size from crunchy_file_cache.list() where path like '%lineitem%' order by path;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│                                                                   path                                                                   │ file_size │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/data/00000-44-aa7679d1-0245-4e1f-b69c-c1a257550e4e-00001.parquet              │  74363325 │
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/data/00001-45-aa7679d1-0245-4e1f-b69c-c1a257550e4e-00001.parquet              │  77538785 │
...
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/data/00019-63-aa7679d1-0245-4e1f-b69c-c1a257550e4e-00001.parquet              │  77545466 │
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/data/00020-64-aa7679d1-0245-4e1f-b69c-c1a257550e4e-00001.parquet              │  68565553 │
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/00000-bb061089-9ee5-44f5-a1ae-e74696a049d4.metadata.json             │      3475 │
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/3b51fcc4-1a5f-4b31-b526-196b70d23408-m0.avro                         │      9922 │
│ s3://cdwtestdatasets/tpch_iceberg/scale_10/public/lineitem/metadata/snap-6194936951242463563-1-3b51fcc4-1a5f-4b31-b526-196b70d23408.avro │      4286 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘
(24 rows)

Currently, we only support querying a specific snapshot of the Iceberg table. You can use tools like pyiceberg to get the latest URL from an Iceberg catalog and run:

alter foreign table myiceberg
options (set path 's3://myiceberg/metadata/v2.metadata.json');

Or you can recreate the foreign table if the schema changed.

Full SQL pushdown

Another major improvement is that complex queries only involving analytics tables and supported functions support being fully “pushed down” (delegated) to DuckDB for maximum acceleration.

We already pushed down filters, projection, aggregates, and sorting, but can now also accelerate window functions, grouping sets, CTEs, subqueries and complex joins.

-- 1) make my data in S3 queryable
create foreign table user_actions ()
server crunchy_lake_analytics
options (path 's3://mybucket/user_actions/*.parquet');

-- 2) create a view that gives the 30 most active users and their activity
create view most_active_users as
with top_users as (
  select user_id, count(*)
  from user_actions
  where event_time >= now() - interval '30 days'
  group by 1 order by 2 desc limit 30
)
select user_id, count, last_event
from top_users
left join lateral (
  select event_payload as last_event
  from user_actions
  where user_id = top_users.user_id
  order by event_time desc limit 1
) on (true);

-- show which part of the query gets handled by DuckDB (all of it!)
explain (verbose, analyze) select * from most_active_users;

When a part of the query (e.g. a custom function or join with regular PG table) cannot be pushed down into DuckDB, the existing pushdown method is used, which is often still very fast.

My favorite part of building these types of integrations is the synergy with other PostgreSQL features and extensions. You can replace complex data pipelines with a few simple SQL commands. For instance, you can create a materialized view on data in S3 (incl. Parquet, Iceberg) and refresh it periodically in the background:

-- 3) create a materialized view from the regular view
create materialized view most_active_users_materialized
as select * from most_active_users;

-- 4) index the materialized view to make it refresh concurrently
create unique index on most_active_users_materialized (user_id);

-- 5) refresh the materialized view in the background every hour
select cron.schedule('@hourly', 'refresh materialized view concurrently most_active_users_materialized ');

You can now build a very low latency dashboard that queries the materialized view, and it will be automatically kept up-to-date when new files show up in S3. There’s no additional tooling, and all we had to do was spin up an Analytics cluster and run 5 PostgreSQL commands.

Full Parquet nested type support: Structs and maps

Postgres is renowned for its extensible type system, though there are certain areas where it has fallen behind other database systems. In particular, Postgres does not have support for generic struct and map types. Instead it relies on pre-defined composite types or type-specific map types, such as text→text (hstore) or text→jsonb (jsonb). In theory you can support arbitrary types by serializing every type to text, but that is neither efficient nor ergonomic.

Parquet files can have deeply nested structs, arrays and maps. To be able to query any Parquet file, we needed to support nested types as well. The solution we found is to dynamically generate struct and map types for the type definitions in the Parquet file when creating a foreign table:

create foreign table overture ()
server crunchy_lake_analytics
options (path 's3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=*/*', format 'parquet');

-- inspect table columns with auto-generated types
\d overture
                                                  Foreign table "public.overture"
┌─────────────┬─────────────────────────────────────────────────────────────────────┬───────────┬──────────┬─────────┬─────────────┐
│   Column    │                                Type                                 │ Collation │ Nullable │ Default │ FDW options │
├─────────────┼─────────────────────────────────────────────────────────────────────┼───────────┼──────────┼─────────┼─────────────┤
│ id          │ text                                                                │           │          │         │             │
│ geometry    │ bytea                                                               │           │          │         │             │
│ bbox        │ crunchy_struct.xmin_xmax_ymin_ymax_35464140                         │           │          │         │             │
│ version     │ integer                                                             │           │          │         │             │
│ update_time │ text                                                                │           │          │         │             │
│ sources     │ crunchy_struct.property_dataset_record_id_confidence_acf6e375[]     │           │          │         │             │
│ names       │ crunchy_struct.primary_common_rules_8f109018                        │           │          │         │             │
│ categories  │ crunchy_struct.main_alternate_58e0a237                              │           │          │         │             │
│ confidence  │ double precision                                                    │           │          │         │             │
│ websites    │ text[]                                                              │           │          │         │             │
│ socials     │ text[]                                                              │           │          │         │             │
│ emails      │ text[]                                                              │           │          │         │             │
│ phones      │ text[]                                                              │           │          │         │             │
│ brand       │ crunchy_struct.wikidata_names_852ae8aa                              │           │          │         │             │
│ addresses   │ crunchy_struct.freeform_locality_postcode_region_country__4d9a9db[] │           │          │         │             │
│ theme       │ text                                                                │           │          │         │             │
│ type        │ text                                                                │           │          │         │             │
└─────────────┴─────────────────────────────────────────────────────────────────────┴───────────┴──────────┴─────────┴─────────────┘
Server: crunchy_lake_analytics
FDW options: (path 's3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=*/*', format 'parquet')

-- inspect auto-generated type
 \d crunchy_struct.xmin_xmax_ymin_ymax_35464140
Composite type "crunchy_struct.xmin_xmax_ymin_ymax_35464140"
┌────────┬──────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼──────┼───────────┼──────────┼─────────┤
│ xmin   │ real │           │          │         │
│ xmax   │ real │           │          │         │
│ ymin   │ real │           │          │         │
│ ymax   │ real │           │          │         │
└────────┴──────┴───────────┴──────────┴─────────┘

-- can query the fields of the auto-generated types directly
SELECT id, geometry
FROM overture
WHERE (bbox).xmin > -1 AND (bbox).xmax < 1 AND (bbox).ymin > -1 AND (bbox).ymax < 1;

Since all the auto-generated nested types are composed of composite types and arrays, you can use standard PostgreSQL syntax to access the fields.

Maps are represented as arrays of (key,value) composite types. Additional auto-generated functions access entries:crunchy_map.extract(map,key) or map->key to get a value, and crunchy_map.entries(map) to get all entries.

Automatic S3 region selection

Amazon S3 can be seen as a global file system. Files in S3 are addressed through URLs of the form s3://mybucket/prefix/data.parquet, and in principle the URL and credentials is all you need to read any file that you have access to.

Unfortunately, S3 has a few quirks resulting from its 18+ year history. The bucket names and credentials are global, but the data files are stored in a specific region. Requests need to be made against a specific region, but we cannot tell which region from the URL. By default, Crunchy Bridge makes requests against the AWS region in which your PostgreSQL server is hosted. To query files from a bucket in a different S3 region we originally had to add query arguments like ?s3_region=us-west-2.

With the new release, the process is fully automatic. Crunchy Bridge for Analytics automatically discovers the actual S3 region for each bucket, so there is no need to specify s3_region anymore. You may incur higher AWS data transfer expenses when loading data from a different region. Fortunately, the local cache helps prevent many repeated reads.

We also heard from customers that they use S3 express. Amazon avoided making the same mistake again and added the region and AZ information to the bucket (”directory”) names. If you use a URL like s3://mydirectory--use1-az4--x-s3/data.parquet then we’ll directly use the right region and endpoint based on the URL.

Write-through caching

You can use your Analytics clusters purely for querying files in Amazon S3, but Analytics clusters are also regular managed PostgreSQL servers—with backups, HA, VPC peering, etc.—which you can use for typical PostgreSQL workloads and hybrid scenarios.

In my previous blog, I discussed how you can set up a time-partitioned table using pg_cron and pg_partman, sync the partitions into Parquet in S3 using COPY .. TO 's3//.../20240703.parquet', and then run fast analytical queries on the historical data set through an analytics table. That way, you can get a combination of high performance insertion of time series data, fast lookups of recent data, and fast analytics on historical data.

There was one small caveat: After writing a partition to Parquet in S3, you needed to manually invalidate the cache. Analytical queries might get a bit slower until the new file is downloaded into the NVMe cache.

We now enhanced our caching logic with write-through caching, meaning we simultaneously write new files generated using COPY .. TO to a local NVMe drive and Amazon S3. That way, you can immediately get the best performance. When replacing an existing file, like resyncing an updated partition, it will get atomically updated on completion to avoid an inconsistent cache.

Listing files in S3

Bridge for Analytics makes it easy to query a directory of files using wildcards (* for matching files in the same directory, ** to match files in any subdirectory). It can be useful to know the exact list of files that will be queried, for which we added a new utility function:

-- show all the files under a wildcard
SELECT * FROM crunchy_lake.list_files('s3://mybucket/events/*.parquet');

Listing files in a SQL query also helps in building advanced data pipelines. You could use pg_cron to periodically check for new files, load and process them, and write the result to another destination.

Hugging Face public data sets

Hugging Face is a popular platform for sharing machine learning models and training data. DuckDB added a Hugging Face file system to query files directly using the hf:// prefix. We added it as well to give direct access to a large number of public data sets from PostgreSQL.

create table spotify ()
with (load_from = 'hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv');

select count(*) from spotify where duration_ms >= 300000;
┌───────┐
│ count │
├───────┤
│ 16375 │
└───────┘
(1 row)

The Hugging Face URLs currently do not use caching. If you access a data set frequently, we recommend moving the data to S3 or loading it into a Postgres table.

Get started with Crunchy Bridge for Analytics

Crunchy Bridge for Analytics is quickly becoming the most powerful PostgreSQL-based analytics solution available. By integrating DuckDB through extension APIs, we can offer state-of-the-art analytics performance in PostgreSQL. By running it on Crunchy Bridge we can offer a fully managed experience in the same AWS region as your data, with caching for additional acceleration.

If you want to give it a spin, follow the Bridge for Analytics docs for more information.

Avatar for Marco Slot

Written by

Marco Slot

July 9, 2024 More by this author