Archive Postgres Partitions to Iceberg

Craig Kerstiens

5 min readMore by this author

Postgres comes with built-in partitioning and you can also layer in for pg_partman for additional help with maintenance of your partitioning. It works quite well for partitioning your data to make it easy to retain a limited set of data and improve performance if your primary workload is querying a small time series focused subset of data. Oftentimes, when implementing partitioning you only keep a portion of your data then drop older data as it ages out for cost management.

But what if we could move old partitions seamlessly to Iceberg that could retain all our data forever, while only maintaining recent partitions within Postgres? Could we have a perfect world of full long term copy in Iceberg easily query-able from a warehouse, but Postgres still functioning as the operational database with the most recent 30 days of data?

With the latest replication support for Crunchy Data Warehouse this works seamlessly, lets dig in.

First lets setup our partitioning

If you’d like to follow along at home, here’s some code to set up a sample set of partitioned data resembling a web analytics data set.

CREATE TABLE page_hits (
    id SERIAL,
    site_id INT NOT NULL,
    ingest_time TIMESTAMPTZ NOT NULL,
    url TEXT NOT NULL,
    request_country TEXT,
    ip_address INET,
    status_code INT,
    response_time_msec INT,
    PRIMARY KEY (id, ingest_time)
) PARTITION BY RANGE (ingest_time);

This function will create a set of partitions for us for the last 30 days.

DO $$
DECLARE
  d DATE;
BEGIN
  FOR d IN SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', INTERVAL '1 day') LOOP
    EXECUTE format($f$
      CREATE TABLE IF NOT EXISTS page_hits_%s PARTITION OF page_hits
      FOR VALUES FROM ('%s') TO ('%s');
    $f$, to_char(d, 'YYYY_MM_DD'), d, d + INTERVAL '1 day');
  END LOOP;
END $$;

Your database should look something like this:

                            List of relations
 Schema |          Name           |       Type        |       Owner
--------+-------------------------+-------------------+-------------------
 public | page_hits               | partitioned table | postgres
 public | page_hits_2025_04_20    | table             | postgres
 public | page_hits_2025_04_21    | table             | postgres
...
 public | page_hits_2025_05_18    | table             | postgres
 public | page_hits_2025_05_19    | table             | postgres
 public | page_hits_id_seq        | sequence          | postgres

Now we can generate some sample data. In this case we’re going to generate 1000 rows per day for each of our tables:

DO $$
DECLARE
  d DATE;
BEGIN
  FOR d IN
    SELECT generate_series(DATE '2025-04-20', DATE '2025-05-19', '1 day'::INTERVAL)
  LOOP
    INSERT INTO page_hits (site_id, ingest_time, url, request_country, ip_address, status_code, response_time_msec)
    SELECT
        (RANDOM() * 30)::INT,
        d + (i || ' seconds')::INTERVAL,
        'http://example.com/' || substr(md5(random()::text), 1, 12),
        (ARRAY['China', 'India', 'Indonesia', 'USA', 'Brazil'])[1 + (random() * 4)::INT],
        inet '10.0.0.0' + (random() * 1000000)::INT,
        (ARRAY[200, 200, 200, 404, 500])[1 + (random() * 4)::INT],
        (random() * 300)::INT
    FROM generate_series(1, 1000) AS s(i);
  END LOOP;
END $$;

Now that we have some data within our Postgres setup lets connect things to our Crunchy Data Warehouse and get them replicated over.

Set up replication to Iceberg

Within the setup you want to specify to publish via the root partition - root=true. This keeps partitions in Postgres but does not partition Iceberg since it has its own organization of data files.

CREATE PUBLICATION hits_to_iceberg
FOR TABLE page_hits
WITH (publish_via_partition_root = true);

Set up the replications users

-- create a new user
CREATE USER replication_user WITH REPLICATION PASSWORD '****';

-- grant appropriate permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;

And on the warehouse end, subscribe to the originating data. Since we’ve specified the create_tables_using Iceberg, this data will be stored in Iceberg.

CREATE SUBSCRIPTION http_to_iceberg
CONNECTION 'postgres://replication_user:****@p.qzyqhjdg3fhejocnta3zvleomq.db.postgresbridge.com:5432/postgres?sslmode=require'
PUBLICATION hits_to_iceberg
WITH (create_tables_using = 'iceberg', streaming, binary, failover);

And here’s the Iceberg table.

                          List of relations
 Schema |          Name           |     Type      |       Owner
--------+-------------------------+---------------+-------------------
 public | page_hits               | foreign table | postgres

Now query data stored in Iceberg from Postgres

Here we can see the daily traffic insights for each country, breaking down the number of hits, success rate, average response time, and top error codes:

SELECT
  date_trunc('day', ingest_time) AS day,
  request_country,
  COUNT(*) AS total_hits,
  ROUND(100.0 * SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate_percent,
  ROUND(AVG(response_time_msec), 2) AS avg_response_time_msec,
  MODE() WITHIN GROUP (ORDER BY status_code) AS most_common_status
FROM
  page_hits
GROUP BY
  day, request_country
ORDER BY
  day, request_countr
          day           | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
 2025-04-20 00:00:00+00 | Brazil          |        128 |                68.75 |                 146.83 |                200
 2025-04-20 00:00:00+00 | China           |        138 |                65.94 |                 145.67 |                200
 2025-04-20 00:00:00+00 | India           |        245 |    64.90000000000001 |                  153.8 |                200
 2025-04-20 00:00:00+00 | Indonesia       |        230 |    64.34999999999999 |                 151.43 |                200

Now drop the older Postgres partition

Since data is replicated and a copy is in Iceberg, we can drop partitions at a specific time to free up storage and memory on our main operational Postgres database.

--drop partition
DROP TABLE page_hits_2025_04_20;
-- show missing partition in the table list
                            List of relations
 Schema |          Name           |       Type        |       Owner
--------+-------------------------+-------------------+-------------------
 public | page_hits               | partitioned table | postgres
 public | page_hits_2025_04_21    | table             | postgres
 public | page_hits_2025_04_22    | table             | postgres

-- query iceberg, data is still there
          day           | request_country | total_hits | success_rate_percent | avg_response_time_msec | most_common_status
------------------------+-----------------+------------+----------------------+------------------------+--------------------
 2025-04-20 00:00:00+00 | Brazil          |        128 |                68.75 |                 146.83 |                200
 2025-04-20 00:00:00+00 | China           |        138 |                65.94 |                 145.67 |                200
 2025-04-20 00:00:00+00 | India           |        245 |    64.90000000000001 |                  153.8 |                200
 2025-04-20 00:00:00+00 | Indonesia       |        230 |    64.34999999999999 |                 151.43 |                200

Summary

Here’s the recipe for simple Postgres archiving with long term cost effective data retention:

1 - Partition your high throughput data - this is ideal for performance and management anyways.

2 - Replicate your data to Iceberg for easy reporting and long term archiving.

3 - Drop partitions at the ideal interval.

4 - Continue to query archived data from Postgres.