Announcing pg_parquet v.0.4.0: Google Cloud Storage, https storage, and more
Aykut Bozkurt
3 min readMore by this author
What began as a hobby Rust project to explore the PostgreSQL extension ecosystem and the Parquet file format has grown into a handy component for folks integrating Postgres and Parquet into their data architecture. Today, we’re excited to release version 0.4 of pg_parquet.
This release includes:
- COPY TO/FROM Google Cloud Storage
- COPY TO/FROM http(s) stores
- COPY TO/FROM stdin/stdout with (FORMAT PARQUET)
- Support Parquet UUID, JSON, JSONB types
If you're unfamiliar with pg_parquet, pg_parquet makes it easy to export and import Parquet files directly within Postgres, without relying on third-party tools. It's not a query engine but a migration tool. When working with pg_parquet if you're looking to export data to other locations you can drop it off in your data lake to then be processed by other engines such as Snowflake, Clickhouse, Redshift, or if you want something Postgres native Crunchy Data Warehouse.
What is Parquet?
Heard about Parquet but not sure what it is? Parquet is an open standard file format that is self documenting for data types and comes with columnar compression. It is a flat file - so a file at point in time of the data you're working with or a subset of your tables. If you're looking to leverage cloud storage for a full database, consider looking into Apache Iceberg which applies a metadata layer and catalog on top of parquet. For simply moving data around, pg_parquet integrates Postgres and parquet with a simple sql handshake.
Working with pg_parquet
Pg_parquet hooks into Postgres to now provide support for moving data in and out cloud storage via the Postgres copy
command. Work with copy
just like you normally work.
-- Copy a Postgres query result into a Parquet file
COPY (SELECT * FROM table) TO '/tmp/data.parquet' WITH (format 'parquet');
-- Copy a Postgres query result into Parquet in S3
COPY (SELECT * FROM table) TO '
[s3://mybucket/data.parquet](s3://mybucket/data.parquet)'
WITH (format 'parquet');
-- Load data from Parquet in S3 to Postgres
COPY table FROM '
[s3://mybucket/data.parquet](s3://mybucket/data.parquet)'
WITH (format 'parquet');
Conclusion
With version 0.4, pg_parquet continues to simplify the process of moving data between Postgres and Parquet. If you’re archiving data, populating a lakehouse, or bridging systems together for data analytics, pg_parquet has a wide variety of use cases. Now that pg_parquet supports all of the public cloud storage areas and a wide variety of data types, it is ready to be integrated into modern data workflows. Also, using COPY
in Postgres, means that pg_parquet is lightweight, performant, and Postgres native.
We’re excited to see how the community puts this release to use and look forward to what’s next. Contributions and feedback are always welcome on GitHub.