Archive Postgres Partitions to Iceberg
5 min readMore by this author
Postgres comes with built-in partitioning and you can also layer in for pg_partman
for additional help with maintenance of your partitioning. It works quite well for partitioning your data to make it easy to retain a limited set of data and improve performance if your primary workload is querying a small time series focused subset of data. Oftentimes, when implementing partitioning you only keep a portion of your data then drop older data as it ages out for cost management.
But what if we could move old partitions seamlessly to Iceberg that could retain all our data forever, while only maintaining recent partitions within Postgres? Could we have a perfect world of full long term copy in Iceberg easily query-able from a warehouse, but Postgres still functioning as the operational database with the most recent 30 days of data?
With the latest replication support for Crunchy Data Warehouse this works seamlessly, lets dig in.
First lets setup our partitioning
If you’d like to follow along at home, here’s some code to set up a sample set of partitioned data resembling a web analytics data set.
CREATE TABLE page_hits (
id SERIAL,
site_id INT NOT NULL,
ingest_time TIMESTAMPTZ NOT NULL,
url TEXT NOT NULL,
request_country TEXT,
ip_address INET,
status_code INT,
response_time_msec INT,
PRIMARY KEY (id, ingest_time)
) PARTITION BY RANGE (ingest_time);
This function will create a set of partitions for us for the last 30 days.
DO $$
DECLARE
d DATE;
BEGIN
FOR d IN SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', INTERVAL '1 day') LOOP
EXECUTE format($f$
CREATE TABLE IF NOT EXISTS page_hits_%s PARTITION OF page_hits
FOR VALUES FROM ('%s') TO ('%s');
$f$, to_char(d, 'YYYY_MM_DD'), d, d + INTERVAL '1 day');
END LOOP;
END $$;
Your database should look something like this:
List of relations
Schema | Name | Type | Owner
--------+-------------------------+-------------------+-------------------
public | page_hits | partitioned table | postgres
public | page_hits_2025_04_20 | table | postgres
public | page_hits_2025_04_21 | table | postgres
...
public | page_hits_2025_05_18 | table | postgres
public | page_hits_2025_05_19 | table | postgres
public | page_hits_id_seq | sequence | postgres
Now we can generate some sample data. In this case we’re going to generate 1000 rows per day for each of our tables:
DO $$
DECLARE
d DATE;
BEGIN
FOR d IN
SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', '1 day'::INTERVAL)
LOOP
INSERT INTO page_hits (site_id, ingest_time, url, request_country, ip_address, status_code, response_time_msec)
SELECT
(RANDOM() * 30)::INT,
d + (i || ' seconds')::INTERVAL,
'http://example.com/' || substr(md5(random()::text), 1, 12),
(ARRAY['China', 'India', 'Indonesia', 'USA', 'Brazil'])[1 + (random() * 4)::INT],
inet '10.0.0.0' + (random() * 1000000)::INT,
(ARRAY[200, 200, 200, 404, 500])[1 + (random() * 4)::INT],
(random() * 300)::INT
FROM generate_series(1, 1000) AS s(i);
END LOOP;
END $$;
Now that we have some data within our Postgres setup lets connect things to our Crunchy Data Warehouse and get them replicated over.
Set up replication to Iceberg
Within the setup you want to specify to publish via the root partition - root=true
. This keeps partitions in Postgres but does not partition Iceberg since it has its own organization of data files.
CREATE PUBLICATION hits_to_iceberg
FOR TABLE page_hits
WITH (publish_via_partition_root = true);
Set up the replications users
-- create a new user
CREATE USER replication_user WITH REPLICATION PASSWORD '****';
-- grant appropriate permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
And on the warehouse end, subscribe to the originating data. Since we’ve specified the create_tables_using Iceberg, this data will be stored in Iceberg.
CREATE SUBSCRIPTION http_to_iceberg
CONNECTION 'postgres://replication_user:****@p.qzyqhjdg3fhejocnta3zvleomq.db.postgresbridge.com:5432/postgres?sslmode=require'
PUBLICATION hits_to_iceberg
WITH (create_tables_using = 'iceberg', streaming, binary, failover);
And here’s the Iceberg table.
List of relations
Schema | Name | Type | Owner
--------+-------------------------+---------------+-------------------
public | page_hits | foreign table | postgres
Now query data stored in Iceberg from Postgres
Here we can see the daily traffic insights for each country, breaking down the number of hits, success rate, average response time, and top error codes:
SELECT
date_trunc('day', ingest_time) AS day,
request_country,
COUNT(*) AS total_hits,
ROUND(100.0 * SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate_percent,
ROUND(AVG(response_time_msec), 2) AS avg_response_time_msec,
MODE() WITHIN GROUP (ORDER BY status_code) AS most_common_status
FROM
page_hits
GROUP BY
day, request_country
ORDER BY
day, request_countr
day | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
2025-04-20 00:00:00+00 | Brazil | 128 | 68.75 | 146.83 | 200
2025-04-20 00:00:00+00 | China | 138 | 65.94 | 145.67 | 200
2025-04-20 00:00:00+00 | India | 245 | 64.90000000000001 | 153.8 | 200
2025-04-20 00:00:00+00 | Indonesia | 230 | 64.34999999999999 | 151.43 | 200
Now drop the older Postgres partition
Since data is replicated and a copy is in Iceberg, we can drop partitions at a specific time to free up storage and memory on our main operational Postgres database.
--drop partition
DROP TABLE page_hits_2025_04_20;
-- show missing partition in the table list
List of relations
Schema | Name | Type | Owner
--------+-------------------------+-------------------+-------------------
public | page_hits | partitioned table | postgres
public | page_hits_2025_04_21 | table | postgres
public | page_hits_2025_04_22 | table | postgres
-- query iceberg, data is still there
day | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
2025-04-20 00:00:00+00 | Brazil | 128 | 68.75 | 146.83 | 200
2025-04-20 00:00:00+00 | China | 138 | 65.94 | 145.67 | 200
2025-04-20 00:00:00+00 | India | 245 | 64.90000000000001 | 153.8 | 200
2025-04-20 00:00:00+00 | Indonesia | 230 | 64.34999999999999 | 151.43 | 200
Summary
Here’s the recipe for simple Postgres archiving with long term cost effective data retention:
1 - Partition your high throughput data - this is ideal for performance and management anyways.
2 - Replicate your data to Iceberg for easy reporting and long term archiving.
3 - Drop partitions at the ideal interval.
4 - Continue to query archived data from Postgres.
Related Articles
- Archive Postgres Partitions to Iceberg
5 min read
- Announcing pg_parquet v.0.4.0: Google Cloud Storage, https storage, and more
3 min read
- Extended Backups for Crunchy Bridge
3 min read
- Logical replication from Postgres to Iceberg
4 min read
- Hacking the Postgres Statistics Tables for Faster Queries
13 min read