Snap Clones with the Postgres Operator on Kubernetes
Getting frequent copies of data for development and testing purposes is a very common use case with our enterprise customer base. We have customers getting data copes for each sprint or development cycle. This increasing data copy problem can put a strain on IT budgets with the storage consumed and the hours spent performing database refreshes.
A common process is to build or refresh environments using database restores. There can be some challenges with this approach. First, the data must be moved from the backup location to newly provisioned storage. Moving large amounts of data around is time consuming and expensive. Another process is to perform storage snapshots of the Postgres data volumes. The risk with this approach is the snapshot could be corrupted or unusable.
Customers using the Crunchy Postgres for Kubernetes can create storage snapshots using pgBackRest delta restores for some of their non-production cloning use cases. This solves the need for frequent data copies, while minimizing storage with thin provisioning and reducing risk. In this post, I will outline the steps to set up snap clones and thin provisioning. I've included example code snippets for cluster creation, configuring PVC storage, creating backup restore jobs, creating new snapshots, and creating a thin provisioned Postgres cluster.
Requirements
The following are requirements for using this solution:
- Crunchy Data Postgres Operator 5.1+
- pgBackRest version 2.38+
- PostgreSQL version 13+
- The archive copy option (archive-copy: "y") must be set in the PostgreSQL manifest under
spec.backups.pgbackrest.global
.
Part 1: Create PostgreSQL Cluster
Assuming that the Crunchy Postgres Operator has already been deployed, the first Part is to deploy a PostgreSQL cluster that will serve as the primary database (clone source database). In this example, the PostgreSQL Cluster is called accountdb
and the following yaml may be saved to a file and applied to have the Operator create the cluster.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: accountdb
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.2-1
openshift: false
port: 5432
postgresVersion: 14
instances:
- name: 'pgc'
replicas: 1
dataVolumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
backups:
pgbackrest:
manual:
repoName: repo1
options:
- --type=incr
global:
archive-copy: 'y'
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.38-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
patroni:
dynamicConfiguration:
postgresql:
parameters:
max_parallel_workers: 2
max_worker_processes: 2
shared_buffers: 256MB
work_mem: 5MB
archive_timeout: 120
The above was saved to a file named pg-accountdb.yaml
and applied via kubectl
.
kubectl apply -f pg-accountdb.yaml
Optionally, use pgbench to generate data in the database to increase the size. In the PVC outputs you will see that about 3GB of data was generated.
pgbench -i -s 160
Part 2: Create PVC and Set Retention
Create a PVC that will be the staging storage. Applying the manifest below will accomplish this task. The PVC name includes the PostgreSQL cluster name to help identify the PVC.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: accountdb-clonestage
spec:
storageClassName: freenas-nfs-csi
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
Since the method for maintaining the delta restores will be a job which can come and go, the next Part is to set the reclaim policy for the PVC's persistent volume to retain.
kubectl patch pv $(kubectl get pvc accountdb-clonestage --template="{{ .spec.volumeName }}") -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
The result will be a configuration that looks similar the following:
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b 20Gi RWO Retain Bound finance/accountdb-clonestage freenas-nfs-csi 9m12s
...
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/accountdb-clonestage Bound pvc-508a64e3-4144-4047-8412-e0a3ee9c776b 20Gi RWO freenas-nfs-csi 9m13s
...
Part 3: Create pgBackRest Full Restore Job
The first execution of the restore will be against the empty staging PVC. If the PVC is not empty, the restore will fail. Note that the version of pgBackRest must match the repo pod version. Also, the version of Postgres used in this example is 14 and several of the mounts/directories reflect this version number in the name. For other versions, be sure to adjust the directory and mount names.
The job can be created as a CronJob to run on a regular schedule. For this post, the standard Job will be used, and the job executed manually to show how coordination with snapshots work. To create the job, the following is applied. With the PVC being completely empty, the delta job should trigger a full restore.
kind: Job
apiVersion: batch/v1
metadata:
name: accountdb-pgbackrest-deltarestore
spec:
parallelism: 1
completions: 1
backoffLimit: 1
template:
metadata:
labels:
job-name: accountdb-pgbackrest-deltarestore
spec:
restartPolicy: Never
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
containers:
- resources: {}
terminationMessagePath: /dev/termination-log
name: pgbackrest
command:
- bash
- -ceu
- --
- |-
/opt/crunchy/bin/pgbackrest-restore.sh
env:
- name: COMMAND
value: restore
- name: PGBACKREST_DELTA
value: 'y'
- name: PGBACKREST_PG1_PATH
value: /pgdata/pg14
- name: PGBACKREST_STANZA
value: db
- name: BACKREST_CUSTOM_OPTS
value:
'--type=none --repo=1 --delta --link-map=pg_wal=/pgdata/pg14_wal'
- name: SELECTOR
value: postgres-operator.crunchydata.com/cluster=accountdb,postgres-operator.crunchydata.com/pgbackrest=,postgres-operator.crunchydata.com/pgbackrest-dedicated=
securityContext:
privileged: false
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /etc/pgbackrest/conf.d
name: pgbackrest-config
readOnly: true
- mountPath: /tmp
name: tmp
- name: pgdata
mountPath: /pgdata
terminationMessagePolicy: File
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.38-0
securityContext:
runAsNonRoot: true
fsGroup: 26
serviceAccount: accountdb-pgbackrest
serviceAccountName: accountdb-pgbackrest
volumes:
- emptyDir:
sizeLimit: 16Mi
name: tmp
- name: pgdata
persistentVolumeClaim:
claimName: accountdb-clonestage
- name: pgbackrest-config
projected:
defaultMode: 420
sources:
- configMap:
name: accountdb-pgbackrest-config
items:
- key: pgbackrest_instance.conf
path: pgbackrest_instance.conf
- key: config-hash
path: config-hash
- secret:
items:
- key: pgbackrest.ca-roots
path: ~postgres-operator/tls-ca.crt
- key: pgbackrest-client.crt
path: ~postgres-operator/client-tls.crt
- key: pgbackrest-client.key
mode: 384
path: ~postgres-operator/client-tls.key
name: accountdb-pgbackrest
dnsPolicy: ClusterFirst
There is only a need to run the job again if a new backup since the last delta restore has been performed. Using the pgBackRest options above, pgBackRest will do a checksum-based delta restore. This means that it will scan all of the files to calculate a checksum and verify against the files in the backup repository. This can be resource intensive. There is an option to have pgBackRest perform timestamp based comparison which is less demanding on resources.
Part 4: Snapshot Storage
Storage vendors vary in the method used to perform snapshots as well as how the snapshots can be used while at the same time preserve the snapshot. Several Kubernetes options exists for performing snapshots of persistent volumes as well. For this example, TrueNAS is used to perform the snapshots. The process on NetApp and Oracle ZFS-SA will be very similar but the exact tasks (api endpoints, order, etc.) will vary based on the vendors solution.
curl -X POST http://jaxtruenas/api/v2.0/zfs/snapshot \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{"dataset": "k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b", "name": "accountdb-snap-20220506-1451"}'
A new snapshot called accountdb-snap-20220506-1451
was created. This snapshot represents a virtual full copy of the accountdb database that is current as of the latest backup. With pgBackRest including the necessary WAL files, the snapshot has everything necessary to recover the database to a consistent state and bring the database online.
When creating a Postgres clone using the snapshot, the default option is to reach the consistent state and open the database. Optionally, one could point the clone to the pgBackRest repo of the source database and specify a specific point in time for recovery. To avoid exposing the source database's pgBackRest to corruption, this is not recommended.
Behind the scenes, TrueNAS is using zfs. Looking at the zfs snapshot we get the following:
truenas> zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b@accountdb-snap-20220506-1451 0B - 3.06G -
Note that the snapshot has a size of 0 since it technically matches the source PVC.
Part 5: Generate 100MB of Change (+/-)
Next, introduce some change to the database and execute an incremental backup. The following can be executed via psql in the accountdb database.
create table employee (eid varchar(10), last_name varchar(100), first_name varchar(100), department varchar(4));
insert into employee (
eid, last_name, first_name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 1000000) s(i);
In the manifest to create the accountdb Postgres cluster, the preferences for a manual backup were provided (spec.backups.pgbackrest.manual). To trigger the Operator to perform the incremental backup all that is needed is to annotate the postgrescluster custom resource.
kubectl annotate postgrescluster accountdb postgres-operator.crunchydata.com/pgbackrest-backup="$( date '+%F_%H:%M:%S' )"
With the manual incremental backup now complete, the following backups exist in the pgBackRest repository.
$ kubectl exec accountdb-repo-host-0 -c pgbackrest -- pgbackrest info
stanza: db
status: ok
cipher: none
db (current)
wal archive min/max (13): 000000010000000B0000001D/000000060000000B0000003E
full backup: 20220506-172122F
timestamp start/stop: 2022-05-06 17:21:22 / 2022-05-06 17:29:07
wal start/stop: 000000010000000B00000024 / 000000010000000B00000024
database size: 23.4GB, database backup size: 23.4GB
repo1: backup set size: 1.3GB, backup size: 1.3GB
incr backup: 20220506-172122F_20220506-194425I
timestamp start/stop: 2022-05-06 19:44:25 / 2022-05-06 19:54:21
wal start/stop: 000000060000000B0000003E / 000000060000000B0000003E
database size: 23.5GB, database backup size: 131.8MB
repo1: backup set size: 1.3GB, backup size: 52MB
backup reference list: 20220506-172122F
Part 6: Update Staging Clone
To bring the staging PVC current, execute the pgBackRest delta restore job again. To perform the delta restore the following kubectl
command is used to trigger Kubernetes to restart our job created earlier.
kubectl get job accountdb-pgbackrest-deltarestore -o json |
jq 'del(.spec.selector)' |
jq 'del(.spec.template.metadata.labels)' |
kubectl replace --force -f -
Part 7: Create New Snapshot
Currently there is only one snapshot that is named accountdb-snap-20220506-1451
which now contains the fully restored database from the full backup. Next step is to take a new snapshot which will represent a virtual full copy that is current to the latest backup set in the pgBackRest repository.
curl -X POST http://jaxtruenas/api/v2.0/zfs/snapshot \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{"dataset": "k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b", "name": "accountdb-snap-20220506-1615"}'
Looking at the zfs volumes in TrueNAS we can see what is taking place behind the scenes and the how much storage is being used to provide these virtual full copies of the database.
Below is the original volume that contains the very latest data. This is the volume that is used by the Persistent Volume Claim (PVC) that is presented to pgBackRest for the delta restore. It is consuming 3.15G which is the size of the full database.
truenas> zfs list | grep 776
NAME USED AVAIL REFER MOUNTPOINT
k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b 3.15G 16.9G 3.15G /mnt/k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b
Looking at the snapshots, our initial snapshot (accountdb-snap-20220506-1451
) that was initially using 0 bytes is now consuming 1.27 MB and the latest snapshot (accountdb-snap-20220506-1615
) is consuming 0 bytes. How can this be? The latest snapshot is pointing to the current live volume (per se) and since that value is not and has not changed since the snapshot there are no changed blocks to store. In the original snapshot, blocks were changed as part of the last delta restore performed and the before picture of the changed blocks are now stored under the snapshot. We modified 100MB+ so the compression and deduplication done by TrueNAS is saving a lot of space.
truenas> zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b@accountdb-snap-20220506-1451 1.27M - 3.06G -
k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b@accountdb-snap-20220506-1615 0B - 3.15G -
Part 8: Clone Initial Snapshot and Create PVC
To preserve the snapshot a clone (thin) of the snapshot is created. Keep in mind that procedures between various storage solutions will be different, but for TrueNAS a simple REST API call will accomplish this task. Since this thin clone will be used for development it will be named accountclone1-snap1
.
curl -X POST http://jaxtruenas/api/v2.0/zfs/snapshot/clone \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{"snapshot": "k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b@accountdb-snap-20220506-1451", "dataset_dst": "k8s/nfs/vols/accountclone1-snap1"}'
Looking at the zfs volumes from within TrueNAS we see the new thinly provisioned clone (snapshot of a snapshot) that is only consuming 8K:
truenas> zfs list
NAME USED AVAIL REFER MOUNTPOINT
k8s/nfs/vols/accountclone1-snap1 8K 166G 3.06G /mnt/k8s/nfs/vols/accountclone1-snap1
One more Part on the TrueNAS side is to enable NFS sharing for this clone.
curl -X POST http://jaxtruenas/api/v2.0/sharing/nfs \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{ "paths": ["/mnt/k8s/nfs/vols/accountclone1-snap1"],
"comment": "accountdb snapshot for accountclone1-snap1",
"hosts": [],
"alldirs": false,
"ro": false,
"quiet": false,
"maproot_user": "root",
"maproot_group": "wheel",
"mapall_user": "",
"mapall_group": "",
"security": [],
"enabled": true,
"networks": []
}'
Now the clone is ready to be presented to Kubernetes. The content of the manifest will vary based on the storage class used in the Kubernetes environment. The following yaml is applied to create a persistent volume (PV) called accountclone1-snap1
.
apiVersion: v1
kind: PersistentVolume
metadata:
name: accountclone1-snap1
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
csi:
driver: org.democratic-csi.nfs
fsType: nfs
volumeAttributes:
node_attach_driver: nfs
provisioner_driver: freenas-nfs
server: 192.168.2.50
share: /mnt/k8s/nfs/vols/accountclone1-snap1
volumeHandle: accountclone1-snap1
mountOptions:
- noatime
- nfsvers=4
persistentVolumeReclaimPolicy: Retain
storageClassName: freenas-nfs-csi
volumeMode: Filesystem
Part 9: Create Thin Provisioned Postgres Cluster (accountclone1
)
With the persistent volume in place, everything is ready to submit the following manifest to create our thinly provisioned development database (accountclone1
). Notice the reference to the PV created in the previous Part under spec.instances.dataVolumeClaimSpec.volumeName
. This tells the Crunchy Operator to use an existing PV when created the PVC for the new cluster.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: accountclone1
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.2-1
openshift: false
port: 5432
postgresVersion: 14
instances:
- name: 'pgc'
replicas: 1
dataVolumeClaimSpec:
storageClassName: freenas-nfs-csi
volumeName: accountclone1-snap1
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
backups:
pgbackrest:
global:
archive-copy: 'y'
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.38-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
After instance recovery completes, our thinly provisioned postgres development cluster is up and running.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
accountclone1-pgc-nhsj-0 3/3 Running 0 3m30s
...
If all is well, and you generated data with pgbench, we should have a pgbench_acccounts
table with 16,000,000 rows and no table called employee. Remember this is from the original backup before the employees table was created.
$ kubectl exec -it -c database $(kubectl get pod \
--selector="postgres-operator.crunchydata.com/cluster=accountclone1,postgres-operator.crunchydata.com/role=master" -o name) \
-- psql
psql (13.5)
Type "help" for help.
postgres=> select count(1) from pgbench_accounts;
count
-----------
160000000
(1 row)
postgres=> select * from employee;
ERROR: relation "employee" does not exist
LINE 1: select * from employee;
^
In the same psql session above, let's create a new table and introduce some change.
create table customer (cid varchar(10), customer_name varchar(100), primary_contact varchar(100), location varchar(4));
insert into customer (
cid, customer_name, primary_contact, location
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 4)
from generate_series(1, 100000) s(i);
The SQL above created a new table called customer that consumes about 10 MB of space.
Part 10: Create Thin Provisioned Postgres Cluster (accountclone2)
With the development database running, let's verify that we can use the second snapshot to create a thin provisioned Postgres cluster that contains the employee table. Several of the steps are a repeat of what was performed for accountclone1-snap1, so description of each Part is reduced.
Clone the snapshot and instruct TrueNAS to share out the new clone:
curl -X POST http://jaxtruenas/api/v2.0/zfs/snapshot/clone \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{"snapshot": "k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b@accountdb-snap-20220506-1615", "dataset_dst": "k8s/nfs/vols/accountclone2-snap1"}'
curl -X POST http://jaxtruenas/api/v2.0/sharing/nfs \
-H 'Authorization: Bearer 1-hICddUlGbjnqsvQajuhYeTRKhdEPuJJJW7o5LaevcHBV3zjJqgti2uQaE0Li3ulW' \
-H 'Content-Type: application/json' \
-d '{ "paths": ["/mnt/k8s/nfs/vols/accountclone2-snap1"],
"comment": "accountdb snapshot for accountclone2-snap1",
"hosts": [],
"alldirs": false,
"ro": false,
"quiet": false,
"maproot_user": "root",
"maproot_group": "wheel",
"mapall_user": "",
"mapall_group": "",
"security": [],
"enabled": true,
"networks": []
}'
With the storage side of the work complete, next Part is to create the persistent volume (PV) and have the Crunchy Operator create the Postgres cluster accountclone2
. The following manifests are applied to accomplish this:
apiVersion: v1
kind: PersistentVolume
metadata:
name: accountclone2-snap1
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
csi:
driver: org.democratic-csi.nfs
fsType: nfs
volumeAttributes:
node_attach_driver: nfs
provisioner_driver: freenas-nfs
server: 192.168.2.50
share: /mnt/k8s/nfs/vols/accountclone2-snap1
volumeHandle: accountclone2-snap1
mountOptions:
- noatime
- nfsvers=4
persistentVolumeReclaimPolicy: Retain
storageClassName: freenas-nfs-csi
volumeMode: Filesystem
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: accountclone2
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.2-1
openshift: false
port: 5432
postgresVersion: 14
instances:
- name: 'pgc'
replicas: 1
dataVolumeClaimSpec:
storageClassName: freenas-nfs-csi
volumeName: accountclone2-snap1
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
backups:
pgbackrest:
global:
archive-copy: 'y'
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.38-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- 'ReadWriteOnce'
resources:
requests:
storage: 20Gi
With our cluster created, let's verify that indeed our thin clone database contains the employee table. Also, in the same session we will introduce some changes (very cheap data redaction via UPDATE
statement) to look at the data storage in our conclusion.
$ kubectl exec -it -c database $(kubectl get pod \
--selector="postgres-operator.crunchydata.com/cluster=accountclone2,postgres-operator.crunchydata.com/role=master" -o name) \
-- psql
psql (13.5)
Type "help" for help.
postgres=> select count(1) from employee;
count
---------
1000000
(1 row)
postgres=> update employee set last_name=(SELECT string_agg(substring('0123456789bcdfghjkmnpqrstvwxyz', round(random() * 30)::integer, 1), ''));
UPDATE 1000000
Part 11: Review Space Consumption
With our thinly provisioned clones in place and some data updates performed, let's look at our space consumption.
truenas> zfs list
NAME USED AVAIL REFER MOUNTPOINT
k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b 3.15G 16.9G 3.15G /mnt/k8s/nfs/vols/pvc-508a64e3-4144-4047-8412-e0a3ee9c776b
k8s/nfs/vols/accountclone1-snap1 21.8M 165G 3.08G /mnt/k8s/nfs/vols/accountclone1-snap1
k8s/nfs/vols/accountclone2-snap1 296M 165G 3.35G /mnt/k8s/nfs/vols/accountclone2-snap1
At the top of the list is our staging volume and as expected it consumes the same amount of space as the original database. After all, it is a full copy. On the other hand, our snapshots are only consuming the space necessary to store the changed blocks. The changes we performed on the accountclone1 is only consuming 22 MB while the accountclone2 database is consuming 296MB.
Conclusion
At this small-scale example, this may not seem like much but avoiding having a full copy of the database results in a cost reduction of over 80%! Think about the savings for databases that are in the terabytes in size.
Using the pgBackRest delta restore and some advanced storage features can come in handy to refresh lower environments, perform recovery tests (after all pgBackRest is doing a restore each time the staging environment), and maybe speed up a few recovery scenarios if the snapshots are adequate for the task.