How We Fused DuckDB into Postgres with Crunchy Bridge for Analytics
Last month we launched Crunchy Bridge for Analytics, a new managed PostgreSQL offering that lets you query your data lake directly from PostgreSQL. Since then, we have had quite a few exciting conversations with customers handling large amounts of data in PostgreSQL. A common question is of course: How does it work?
In this post, I wanted to shed some light on the internals. Crunchy Bridge for Analytics abstracts the query engine to offer fast analytics on data in Amazon S3 in PostgreSQL. In principle, it can support multiple query engines, and it likely will in the future, but the current query engine is DuckDB.
A bit of history: Distributed SQL pushdown in Citus
Before joining Crunchy Data at the start of the year, my colleague Onder and I spent much of the last decade working on the Citus extension for PostgreSQL. It was the first PostgreSQL extension that achieved substantial commercial success, and remains the most mature distributed PostgreSQL solution available today.
A distributed database system like Citus divides tables into shards, which are assigned to different servers. In Citus, individual tables can be distributed and the shards are regular PostgreSQL tables divided across a cluster of PostgreSQL servers. A key feature of Citus is that it uses PostgreSQL hooks to transparently break down a query plan into parts that can be “pushed down” to shards, and parts that execute on the coordinating node. The parts of the query plan that can be pushed down are converted back to SQL statements and sent to shards in parallel.
Example of pushing down & parallelizing part of a SQL query plan in a Citus cluster
The use of SQL as the protocol between database nodes is powerful. It means that Citus integrates cleanly with most PostgreSQL features and extensions and provides similar performance characteristics for single node queries. Of course, where the rubber meets the road is correctly integrating with prepared statements, partitioning, PL/pgSQL, insert..select, cursors, search_path, savepoints, stored procedures, concurrent schema changes, triggers, cancellations, etc. After all, hacking on PostgreSQL is really hard. Luckily, both the software and the engineers have become battle-hardened.
Citus also has a columnar access method, which we introduced for efficient compression of old partitions in time-partitioned tables. The PostgreSQL query engine itself is very much row-oriented, so while you get speed ups from reduced I/O and faster projections, it is far-removed from state-of-the-art analytics engines. Moreover, the majority of the world’s data is consolidating into object stores like Amazon S3 and analytics tools need to adjust to the new reality.
SQL pushdown into an OLAP engine: Enter DuckDB!
Over the last 1-2 years, it became apparent that a new wave of embedded query engines is disrupting the analytics space. The wave is led by DuckDB.
DuckDB was developed by Hannes Mühleisen and Mark Raasveldt at the CWI, where vectorized execution was invented, and is one of the fastest OLAP tools available. It is now being developed by DuckDB Labs with many community contributions. The original goal of DuckDB was to make it easier to use modern OLAP techniques and fast SQL in typical data science workflows (e.g. Python scripts with data frames) with excellent support for querying Parquet and files in object storage. As a side-effect it became an embeddable query engine that is used in various data systems.
The SQL syntax in DuckDB is very similar to PostgreSQL. That's not a coincidence. Our former Citus colleague Lukas Fittl (now running pganalyze) created the very useful pg_query library which is derived from the SQL parser in PostgreSQL. This library is the basis of the DuckDB parser.
What that means is that we can integrate DuckDB using an approach that is similar to query pushdown in Citus: We recognize the parts of the query plan that can be pushed down into DuckDB for vectorized, parallel execution, and construct the appropriate SQL queries to pass to DuckDB. We again use a combination of PostgreSQL hooks to achieve that for filters, aggregates, joins, as well as more complex query structures. In some cases the full query can be pushed down, in others we merge different subplans together.
Example of pushing down part of a SQL query plan into DuckDB for parallelization & vectorized execution in Crunchy Bridge for Analytics
Of course, there is a lot more to it. The type system, functions, operators, and SQL constructs do not fully match and so we rewrite queries in subtle ways. We also paid a lot of attention to correctly handling memory allocation, threads, cancellations, errors, locking, etc. Our goal is to provide you with a reliable 100% PostgreSQL experience with significant performance improvements where possible.
Managed PostgreSQL with fast data lake analytics
Crunchy Bridge is a managed PostgreSQL service that runs on top of the major cloud providers. This means that a Crunchy Bridge for Analytics instance can run in the same AWS region as your S3 buckets and offer the user fast two-way access. Moreover, we take advantage of fast NVMe drives attached to the machine, which serve as an ultra high bandwidth cache.
Integrating DuckDB into PostgreSQL on the Crunchy Bridge platform meant that you get one of the most efficient ways of running analytical queries on your data lake with just a few clicks, and the convenience of it being deeply integrated into PostgreSQL.
We have often found a strong correlation between companies with large amounts of data requiring high-performance analytics and a growing set of requirements on managing database systems. It's important for a managed database service to provide VPC integration, customer-managed keys, high availability (both in theory and in practice), disaster recovery, logging integrations, etc. We spent many years building Crunchy Bridge into a mature, robust, and performant managed service, based on the lessons from building similar services at Heroku, Citus Data, and Azure.
Composing and extending PostgreSQL with DuckDB
Databases are increasingly becoming compositions of various components. It’s a trend that will undoubtedly move faster in analytics/OLAP than in the tightly-coupled, latency-sensitive OLTP side of database systems. However, PostgreSQL’s extensibility means it is not strictly an OLTP database anymore, but can be augmented with other query engines through extensions and auxiliary components.
Overall, we are very happy about choosing DuckDB as the query engine for our new analytics offering and would like to thank the DuckDB team for quick fixes on a small number of bugs we found. We find DuckDB is generally faster than DataFusion and more comprehensive than chdb, impressive as they are, and overall the best match for PostgreSQL.