Postgres Materialized Views from Parquet in S3 with Zero ETL

Marco Slot

4 min read

Data pipelines for IoT applications often involve multiple different systems. First, raw data is gathered in object storage, then several transformations happen in analytics systems, and finally results are written into transactional databases to be accessed by low latency dashboards. While a lot of interesting engineering goes into these systems, things are much simpler if you can do everything in Postgres.

Crunchy Bridge for Analytics is a managed PostgreSQL offering that integrates DuckDB into PostgreSQL for fast analytics. It gives you an easy way to query a set of Parquet files in S3 with high performance using Postgres via a foreign table. You can then also use many other Postgres features, such as stored procedures, pg_cron, and materialized views.

In this blog post, we’ll have a look at how you can easily set up a sophisticated database pipeline in PostgreSQL in a few easy steps.

Accessing Parquet in S3 using PostgreSQL

After creating an Analytics cluster and setting up your AWS credentials, you can start querying Parquet/CSV/JSON files in your S3 account by creating a foreign table. The schema of the table will be automatically inferred from the Parquet files if left empty:

-- create an analytics table for all the Parquet files that match the wildcard
create foreign table events ()
server crunchy_lake_analytics
options (path 's3://mybucket/events/*.parquet');

-- get the number of purchase events per day for the last 30 days
select date_trunc('day', event_time), count(*)
from events
where event_time >= date_trunc('day', now() - interval '30 days')
and event_type = 'purchase'
group by 1 order by 1 desc;

By delegating part of the query to DuckDB, we can answer it relatively quickly and only the parts of the Parquet files that are needed for the query are fetched from S3. In the background, the Parquet files will also get cached on a local NVMe drive, which speeds up later queries.

Creating an indexed materialized view from Parquet files

Being able to query Parquet files quickly is great, but it can still take a few seconds of all your CPUs working hard. If you have many users looking at a dashboard simultaneously resulting in many concurrent queries, things tend to slow down.

A solution is to create a materialized view the pre-aggregates the data, and we can do that easily PostgreSQL in 2 steps:

-- 1) Set up a materialized view for all of the dimensions and aggregates we care about.
--    The heavy lifting is done by DuckDB.
create materialized view events_per_day as
select date_trunc('day', event_time) as day, event_type, count(*)
from events
group by 1, 2;

-- 2) Always create a unique index on materialized views
--    for fast lookups and to enable concurrent refresh
create unique index on events_per_day (day, event_type);

After setting up the view, we can get the same query result in a few milliseconds by doing an index lookup on the materialized view:

select day, count
from events_per_day
where event_type = 'purchase' and day >= date_trunc('day', now() - interval '30 days')
order by 1;

A materialized view with an index can easily support tens of thousands of requests per second, even on a small server. You can create multiple views to power commonly viewed charts and tables in your BI dashboard.

Automatically refreshing the materialized view

Of course, if we’re dealing with a stream of IoT data, then new Parquet files will show up periodically, which means our view is no longer up-to-date.

Fortunately, we can easily refresh the materialized view in PostgreSQL. We use the “concurrently” option to avoid blocking reads.

-- refresh the view daily at 3am UTC
select cron.schedule('3 0 * * *', $$
    refresh materialized view concurrently events_per_day;
$$);

If you’re unfamiliar with the cron syntax, I recommend using the Crunchy Scheduler UI in the Crunchy Bridge dashboard.

What’s really neat is that we won’t repeatedly download the same Parquet files when refreshing the materialized view. Instead, Crunchy Bridge caches the files that it already processed, such that the next run only needs to pull in new files.

PostgreSQL for advanced data pipelines

A PostgreSQL server is a unique place where many different DBMS features, extensions, and tools can be seamlessly combined to tackle really hard problems.

It is worth mentioning that Crunchy Bridge for Analytics also has excellent support for exporting tables and query results back into S3 to publish them in other applications. For instance, you could export views directly into a JSON file, such that you can build your dashboard as a static site.

copy (
  select * from events_per_day where event_type = 'purchase'
)
to 's3://mybucket/dashboard/purchase_data.json';

You’ll find that by combining the tools that PostgreSQL and Crunchy Bridge give you, you can start replacing many other systems.

Check out Crunchy Bridge for Analytics docs on how to get started.

Avatar for Marco Slot

Written by

Marco Slot

September 4, 2024 More by this author