Crunchy Bridge for Analytics: Your Data Lake in PostgreSQL

Marco Slot

10 min read

A lot of the world’s data lives in data lakes, huge collections of data files in object stores like Amazon S3. There are many tools for querying data lakes, but none are as versatile and have as wide an ecosystem as PostgreSQL. So, what if you could use PostgreSQL to easily query your data lake with state-of-the-art analytics performance?

Today we’re announcing Crunchy Bridge for Analytics, a new offering in Crunchy Bridge that lets you query and interact with your data lake using PostgreSQL commands via extensions, with a vectorized, parallel query engine.

With Bridge for Analytics you can easily set up tables that point directly to Parquet, CSV, or JSON files in object storage, without having to specify which columns are in the file(s), and run very fast analytical queries.

Moreover, Bridge for Analytics comes with powerful data import and export capabilities, enabling you to easily create regular or temporary tables from files in object storage, load additional data, or export tables and query results back into object storage. And of course, you have all the existing benefits of Crunchy Bridge, an enterprise-grade managed PostgreSQL service, including saved queries, built-in connection pooling, VPC, container apps, and much more.

Let’s dive in!

Querying files in your data lake

We wanted to make querying your data lake simple, fast, and well-integrated into PostgreSQL. We found foreign tables are ultimately the most suitable infrastructure for that. Bridge for Analytics gives you a simple interface for creating foreign tables from a wide variety of data files.

To get started with Bridge for Analytics, first set up an analytics instance in Bridge add your credentials, and connect using psql or your favorite PostgreSQL client. When you’re connected to your instance, you can create a foreign table that points to files in your data lake using server crunchy_lake_analytics:

-- create a table from a Parquet file, column definitions can be empty
create foreign table hits ()
server crunchy_lake_analytics
options (path 's3://mybucket/hits.parquet');

Notice that specifying the column names and types is unnecessary! For Parquet files, we’ll infer the schema directly from the file metadata if you leave the column definitions empty. For CSV and JSON, we’ll make an informed guess based on the file structure, which might take a bit longer.

Once your foreign table is created, you can immediately start querying your data and take advantage of lightning-fast analytics:

\d hits                                                                                                                                          [0/1998]
┌───────────────────────┬──────────────────────────┬───────────┬──────────┬─────────┐
│        Column         │           Type           │ Collation │ Nullable │ Default │
├───────────────────────┼──────────────────────────┼───────────┼──────────┼─────────┤
│ watchid               │ bigint                   │           │          │         │
│ javaenable            │ smallint                 │           │          │         │
│ title                 │ text                     │           │          │         │
│ goodevent             │ smallint                 │           │          │         │
...

-- count ~100M rows
select count(*) from hits;
┌──────────┐
│  count   │
├──────────┤
│ 99997497 │
└──────────┘
(1 row)

Time: 55.530 ms

You can also use a wildcard in the path (e.g. s3://mybucket/hits/*.parquet) to query a list of files.

Bridge for Analytics takes advantage of range requests to speed up queries on Parquet files. In the background, files will also be automatically cached on NVMe drives for improved performance. Once the download completes, the queries will get even faster.

Example:

-- Run a query on a ~100M row Parquet file in S3
select AdvEngineID, count(*) from hits where AdvEngineID <> 0
group by 1 order by 2 desc limit 5;
┌─────────────┬────────┐
│ advengineid │ count  │
├─────────────┼────────┤
│           2 │ 404602 │
│          27 │ 113167 │
│          13 │  45631 │
│          45 │  38960 │
│          44 │   9730 │
└─────────────┴────────┘
(5 rows)

Time: 317.460 ms

-- Add a file to the cache, or wait for background caching to be done.
select * from crunchy_file_cache.add('s3://mybucket/hits.parquet');

-- Run a query on a ~100M row Parquet file in cache
select AdvEngineID, count(*) from hits where AdvEngineID <> 0
group by 1 order by 2 desc limit 5;
┌─────────────┬────────┐
│ advengineid │ count  │
├─────────────┼────────┤
│           2 │ 404602 │
│          27 │ 113167 │
│          13 │  45631 │
│          45 │  38960 │
│          44 │   9730 │
└─────────────┴────────┘
(5 rows)

Time: 90.109 ms

Queries use a combination of the PostgreSQL executor and an analytical query engine. That way, all SQL queries are supported—including joins with regular PostgreSQL tables—but queries that cannot yet push down into the analytics engine may be slower. You’re most likely to see performance benefits when using Parquet due to its columnar format and compression.

Data lake import & export

Apart from querying your data lake directly, Bridge for Analytics also lets you easily import data from your data lake into regular PostgreSQL tables, or export tables and query results back into your data lake.

For easy import/export, we modified the copy and create table commands (via extensions) to accept URLs, and to add new formats. You can use copy .. to/from 's3://…' and specify format (csv, parquet, json) and compression (none, gzip, zstd, snappy). By default, the format and compression are inferred from the file extension. You can also load files directly in your create tablestatements using the load_from option, or definition_from if you only want the column definitions.

For instance, here are some data import examples:

-- Create a temporary table from compressed JSON
create temp table log_input ()
with (load_from = 's3://mybucket/logs/20240101.json', compression = 'zstd');

-- Alternatively, only infer columns and load data separately using the copy command
create temp table log_input ()
with (definition_from = 's3://mybucket/logs/20240101.json', compression = 'zstd');
copy log_input from 's3://mybucket/logs/20240101.json' WITH (compression 'zstd');
copy log_input from 's3://mybucket/logs/20240102.json' WITH (compression 'zstd');

-- Clean the input and insert into a heap table
insert into log_errors
select event_time, code, message from log_input where level = 'ERROR';

You can also use the COPY .. TO command to export tables and query results to object storage:

-- Export query result to a Parquet file, compressed using snappy by default
copy (
  select date_trunc('minute', event_time), code, count(*)
  from log_errors where event_time between '2024-01-01' and '2024-01-02'
  group by 1, 2
) to 's3://mybucket/summaries/log_errors/20240101.parquet';

Finally, we made sure you can easily import/export the new COPY formats from the client, for instance using the \copy meta-command in psql.

$ psql
-- Import a compressed newline-delimited JSON file from local disk
\copy data from '/tmp/data.json.gz' with (format 'json', compression 'gzip')

-- Export a Parquet file to local disk
\copy data to '/tmp/data.parquet' with (format 'parquet')

-- Note: always specify format & compression when using \copy in psql, because the
-- local file extension is not visible to the server.

These enhancements to copy and create table make it easy to set up back-and-forth integrations between PostgreSQL and the other applications that access your data lake.

Writing to analytics tables

While first-class support for transactions on analytics tables is still under development, you may have realized that it is possible to use the copy .. to command in Bridge for Analytics to write to a path that is included in an analytics table.

For instance:

-- create a table pointing to a URL with a wildcard
create foreign table events_historical (
    event_time timestamptz,
    user_id bigint,
    event_type text,
    message text
)
server crunchy_lake_analytics options (path 's3://mybucket/events_historical/*.parquet';

-- write data from a heap table to a file that falls under the wildcard
copy (select * from events where event_time between '2024-01-01' and '2024-01-02')
to 's3://mybucket/events_historical/202401.parquet';

-- data is now visible in the analytics table
select count(*) from events_historical;
┌───────────┐
│   count   │
├───────────┤
│ 148024181 │
└───────────┘
(1 row)

Using this approach, you can feed data from regular PostgreSQL tables into your analytics table. For example, you could create partitioned table of events for fast inserts and updates on your analytics server, and use pg_cron to periodically rotate partitions that are no longer receiving new inserts into a separate crunchy_lake_analytics table, for fast analytics and cheap long-term storage.

It is worth noting a few limitations around writes that we plan to address over time:

  • Caching is currently done after a read, not on write, so the first few analytical queries after writing data may be slower.
  • Once a file is cached, it is not automatically refreshed if it changes, but you can use cache control functions to force a refresh.
  • Exporting data is not transactional; rolling back will not delete the file. You may find it helpful to do some bookkeeping on which partitions were successfully exported.

Running ClickBench on Analytics and Regular tables

To give you a sense of the performance of Bridge for Analytics, we used ClickBench data and queries to compare running analytical queries on a crunchy_lake_analytics table and on a regular heap table, on the same machine.

ClickBench involves a set of 43 queries on a wide table with 105 columns and ~100M rows. Database systems import the data in different ways and adjust the query set. We used the original hits.parquet file and the postgresql queries provided by ClickBench.

The Parquet file does not 100% match the types that the postgresql queries expect. We therefore created a view to convert to the schema expected by the PostgreSQL queries.

create foreign table hits_parquet ()
server crunchy_lake_analytics
options (path 's3://mybucket/hits.parquet');

-- convert EventDate and EventTime to proper types using a view
-- See https://gist.github.com/marcoslot/9e3c21ddf95c93e20a6d3ad1d9193842
create view hits as select ... from hits_parquet;

We then ran the query set with and without caching on NVMe, and on heap tables. We summed the individual query times to obtain the overall runtime.

ClickBench total runtime on Crunchy Bridge analytics instance (32 vcpus).svg

As you can see, Bridge for Analytics outperforms vanilla PostgreSQL by over 20x in this benchmark. The reason for this huge difference is primarily that PostgreSQL’s regular storage format (heap tables) and executor are optimized for operational workloads, not analytics. Fortunately, PostgreSQL’s extensibility means we can make it good at anything!

Using Bridge for Analytics with saved queries

Bridge for Analytics is still just PostgreSQL and Crunchy Bridge. That means that you can use any of the Crunchy Bridge features including saved queries—queries you can write (with AI), save, organize, and run via the Bridge dashboard. Query results are stored and can be shared with your team, or publicly as a webpage or as CSV/JSON.

crunchy bridge for analytics saved query

We use saved queries all the time at Crunchy Data for internal business metrics. It’s a very low friction way to share insights from analytical query results.

Overview of your new PostgreSQL super powers

We packed a lot of new superpowers into Bridge for Analytics. In case you’re losing track, let’s review them one more time with example syntax:

  • Create analytics table from a Parquet/CSV/JSON files in object storage: CREATE FOREIGN TABLE data () SERVER crunchy_lake_analytics OPTIONS (path 's3://mybucket/data/*.parquet');

  • Create a regular table from a file and immediately load the data: CREATE TABLE data () WITH (load_from = 's3://mybucket/data.csv.gz');

  • Create a regular table whose columns are based on a file: CREATE TABLE data () WITH (definition_from = 's3://mybucket/data.json');

  • Load data into a regular table: COPY data FROM 's3://mybucket/data.parquet';

  • Export a table to a file: COPY data TO 's3://mybucket/data.csv.zst' WITH (header);

  • Save a query result in a file: COPY (SELECT * FROM data JOIN dim USING (id)) TO 's3://mybucket/joined.json.gz';

  • Import/export local files in Parquet and JSON format: \copy data TO 'data.parquet' WITH (format 'parquet')

You can use each command with Parquet (uncompressed/gzip/zstd/snappy), CSV (uncompressed/gzip/zstd) and newline delimited JSON (uncompressed/gzip/zstd), and of course every feature in PostgreSQL 16 and every extension on Crunchy Bridge.

Our goal is to give you a Swiss army knife for interacting with your data lake. By further combining Bridge for Analytics features with existing PostgreSQL features and extensions, like pg_cron and postgres_fdw, you can build sophisticated analytics pipelines entirely in PostgreSQL.

Get started with Crunchy Bridge for Analytics

It only takes a few minutes to register for Crunchy Bridge and create your first analytics instance. You can configure your AWS credentials via the dashboard and start creating analytics tables or import data from your data lake. Check out the Crunchy Bridge for analytics docs for more details.

We believe Crunchy Bridge for Analytics can help bridge the gap between your operational and analytical workloads, and thereby significantly simplify your stack and reduce costs. We look forward to going on this journey with you. Feel free to contact us with any questions, issues, or suggestions.

Avatar for Marco Slot

Written by

Marco Slot

April 30, 2024 More by this author