Creation and clean up of WAL files in the primary's pg_wal folder (pg_xlog prior to PG10) is a normal part of PostgreSQL operation. The WAL files on the primary are used to ensure data consistency during crash recovery. Use of write-ahead logs (also called redo logs or transaction logs in other products) is common for data stores that must provide durability and consistency of data when writing to storage. The same technique is used in modern journaling and log-structured filesystems.
As the DB is operating, blocks of data are first written serially and synchronously as WAL files, then some time later, usually a very short time later, written to the DB data files. Once the data contained in these WAL files has been flushed out to their final destination in the data files, they are no longer needed by the primary. At some point, depending on your configuration, the primary will remove or recycle the WAL files whose data has been committed to the DB. This is necessary to keep the primary's disk from filling up. However, these WAL files are also what streaming replicas read when they are replicating data from the primary. If the replica is able to keep up with the primary, using these WAL files generally isn't an issue.
If the replica falls behind or is disconnected from the primary for an extended period of time, the primary may have already removed or recycled the WAL file(s) that a replica needs (but see Streaming Replication Slots below). A replica can fall behind on a primary with a high write rate. How far the replica falls behind will depend on network bandwidth from the primary, as well as storage performance on the replica.
To account for this possibility, we recommend keeping secondary copies of the WAL files in another location using a WAL archiving mechanism. This is known as WAL archiving and is done by ensuring archive_mode is turned on and a value has been set for the archive_command. These are variables contained in the postgresql.conf file.
Whenever the primary generates a WAL file, this command is run to make a secondary copy of it. Until that archive_command succeeds, the primary will keep that WAL file, so you must monitor for this command failing. Otherwise the primary's disk may fill. Once WAL archiving is in place, you can then configure your replicas to use that secondary location for WAL replay if they ever lose their connection to the primary.
This process is explained more in the PostgreSQL documentation.
Note that configuration details for creating a cluster have changed starting in PostgreSQL 12. In particular, the recovery.conf file on a replica instance no longer exists and those configuration lines are now part of postgresql.conf. If you're using PostgreSQL 12 or newer, read the documentation carefully and note the new files recovery.signal and standby.signal
Crunchy Data provides the pgBackRest tool which provides full WAL archiving functionality as well as maintaining binary backups. We do not recommend the simple copy mechanism given as an example in the documentation since that does not provide the resiliency typically required for production databases. pgBackRest provides full, differential and incremental backups as well as integrated WAL file management.
WAL archiving and backups are typically used together since this then provides point-in-time recovery (PITR) where you can restore a backup to any specific point in time as long as you have the full WAL stream available between all backups.
pgBackRest is an integral part of the Crunchy HA and Crunchy PostgreSQL Operator products, and is what we recommend for a binary backup and archive tool.
See https://pgbackrest.org/ for more information on pgBackRest.
The most common case where PostgreSQL won't start because it can't find a WAL file is in a replicated cluster where a replica has been disconnected from the cluster for some time. Most of the rest of this article will discuss ways to diagnose and recover this case.
For most of this article we will discuss the cluster case, where:
- Cluster of 2 or more PostgreSQL hosts
- Using WAL archiving via the PostgreSQL archive_command configuration, plus binary backups, likely and preferably pgbackrest
- On replicas, the recovery.conf (or postgresql.auto.conf in pg 12 and newer, see also standby.signal and recovery.signal) has a line with restore_command = that pulls WAL files from a binary backup location and applies them to the replica.
- If using pgbackrest, the line may look like:
restore_command = 'pgbackrest --stanza=demo archive-get %f "%p"'
- It's common to use both PostgreSQL binary streaming replication and WAL file archiving together.
- It's common to use a tool like pgbackrest for both binary backups and WAL archiving. Combining the two gives you the opportunity to restore a DB to a specific point in time (PITR).
- A replica will doesn't start completely and won't accept read-only connections.
- If using Crunchy HA (or Patroni), patronictl list may show no leader, or Lag in DB Unknown and cluster members stopped.
- This is a common symptom when the primary has already recycled/removed the requested WAL file:
2020-03-13 09:32:22.572 EDT  ERROR: requested WAL segment 00000002000000050000007C has already been removed
- You may see log entries in the pg_log logs like:
2020-04-17 14:29:49.479 P00 INFO: unable to find 0000001600000000000000C2 in the archive,
2020-04-17 14:29:49 EDT : [6-1], , FATAL: requested timeline 23 does not contain minimum recovery point 0/C2A56FC0 on timeline 22
and especially these:
2020-04-17 14:29:49 EDT : [5-1], , LOG: database system is shut down
2020-04-17 14:29:49 EDT : [1-1], , LOG: database system was interrupted while in recovery at log time 2020-04-17 14:19:28 EDT
2020-04-17 14:29:49 EDT : [2-1], ,
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
The underlying cause for a replica getting out of sync and not able to replay WAL (either via streaming replication or from the WAL archive / backup) is almost always an infrastructure issue and almost always network connectivity interruptions. To have a HA cluster you must have reliable network connectivity between the cluster members and from each of the cluster members to the WAL archive (backup) location.
It's worth a reminder that time synchronization across cluster member hosts is critically important for correct cluster operation. Always check and confirm that all nodes in the cluster have a NTP service running (e.g. - ntpd or chronyd), and that the nodes are correctly synced to each other and to a master time source.
It is common to use the same tools for binary backups and WAL archiving. A common configuration in a HA cluster is to use pgbackrest as both the backup tool and the WAL archiving and playback tool. With pgbackrest and other binary backup tools, you will likely have a backup schedule that does a full backup of the cluster primary server periodically and differential or incremental backups periodically between the full backups. Along with the backup schedule are backup retention settings. For example, you may have configured your cluster to retain the last three full backups and the last three incremental or differential backups.
In addition to the backups, pgbackrest will retain WAL files that are needed to do a point-in-time recovery from your full, differential and incremental backups. So if a replica has been disconnected long enough (several backup cycles) for the archived WAL files to be past their retention period, when reconnected, PostgreSQL will be far behind the current state and will attempt to restore archived WAL files that no longer exist. In that case, you will need to reset or reinitialize the replica from current backups and the WAL files that are relative to them.
If the DB data disk fills completely on the replica, the replica will stop accepting and applying WAL updates from the primary. If this isn't caught and repaired for some time, the primary may have removed older WAL files. The length of time will depend on the configuration on the primary and the change rate on the primary. See the documentation for wal_keep_segments to retain more older WAL files.
Another failure mode is when you have intermittent network connectivity among hosts, and the cluster fails over and fails back several times. For each failover, a replica is promoted to primary (failover or switchover), and the WAL timeline is incremented. If one of the replicas can't communicate with the cluster for some time, its local DB will be based on an earlier timeline, and when it attempts to restore a WAL file from the earlier timeline, that WAL file won't be found in the archive, which will default to the current primary's timeline. Note that there is an option to specify the timeline in the recovery.conf file but you probably want to fix the replica to be on the current primary's timeline. See here for more information.
This is simplified when using pgbackrest, which is the recommended method and the method used by Crunchy HA, and the backup-standby option is enabled. With this option enabled, backups are done from a replica/standby host rather than the primary. There is more explanation here in the pgbackrest documentation.
Check and confirm that the former primary was properly re-synch'ed to the new primary. This should have been done automatically by your cluster software, and involves cloning the new primary's data directory to the new replica using a tool like pg_rewind or pgBackRest's delta restore (or full restore). If the new replica (former primary) is logging messages about being on an older timeline, the re-synch may not have happened or may not have been done correctly.
Repairing / Fixing a Replica
It's likely that the best and only option will be to restore the replica from the backup/archive server. We recommend pgbackrest. Crunchy Data products, including the Crunchy PostgreSQL Operator, Crunchy HA and others include and use pgbackrest.
- pgbackrest restore
- full restore vs delta restore. A pgbackrest delta restore can save time and resources. It checks the PostgreSQL destination and restore only files that are needed and don't exist in the destination.
If you're using Crunchy HA or the Crunchy PostgreSQL Operator, you want to use the HA layer tools rather than using pgbackrest directly.
to restore a cluster member node (PostgreSQL instance) from the backup/archive system.
Repairing / Fixing a Standalone or Failed Primary
- This one is potentially more serious and will be a reminder of why you want regular, reliable and carefully tested backups, especially the standalone server case, and why a properly managed replica and backups are a very good idea for data you like.
- This can be caused by broken or failing hardware, misconfigured storage, power failures, or mistakenly using OS level commands to modify the DB directory contents.
- Avoid the temptation to just run pg_resetwal (pg_resetxlog in versions prior to PG 10). More on this below, and please read the man page and/or pg docs on pg_resetwal risks.
- The first thing to do is properly and completely stop the failing standalone or primary.
- If your primary is part of a cluster, and your cluster uses software to manage auto-failover, like Crunchy HA, Patroni, or pacemaker/corosync, your cluster control software should detect a failure, and properly failover to one of the replicas. Occasionally it doesn't and you will have to manually stop the failed primary/standalone. If you're not using software to manage your cluster, failover is a manual process, where you first shut down the failed primary, then promote one of the replicas to be the new primary. If you are using software for auto-failover, and it hasn't automatically failed over, what you need to do is first, shutdown the failed primary instance, preferably using the same mechanism that was used to start it - patronictl if it's using Patroni or Crunchy HA, or pcs or crmsh if you're using pcs/corosync/pacemaker clustering, or an OS-level utility like systemd's systemctl or the Linux service, or pg_ctl. Even if those complete successfully, you want to check to be certain that all the PostgreSQL processes have stopped. Occasionally, none of those stop all the PostgreSQL processes. In that case, you need to use the Linux kill command. To stop the postmaster, the three signals that correspond to "smart", "fast" and "immediate" shutdown are SIGTERM, SIGINT and SIGQUIT. The very last option to use is SIGKILL (kill -9); avoid doing that unless it's a true emergency.
- Once all the PostgreSQL processes are no longer running, make a copy of the DB's data directory using your favorite OS-level snapshot utility - tar, cpio, rsync or your infrastructure's disk snapshotting tool. This is a safe copy in case attempts to repair the DB cause more damage.
- To get the DB back to healthy, the best option is to recreate the DB from a good backup. If you have a good replica (now likely the primary), make the failed primary a replica from the new primary, after fixing the underlying cause.
- In the worst case, where you have only the broken DB directory, you may be able to get it working again but you will likely lose data. First make an OS level copy of the DB directory. Avoid using OS level commands to move or delete DB files; you will likely make things worse.
- It may be tempting to just run pg_resetwal. While it may be possible to recover from some errors using pg_resetwal, you will likely lose data in the process. You can also do more damage to the DB. If you need to minimize data loss, don't have a replica to restore or a good backup, and don't have detailed knowledge of PostgreSQL, it's time to ask for help. It may be possible to recover most of your DB but read the documentation and note this advice from the documentation: "It should be used only as a last resort, when the server will not start due to such corruption."
Streaming Replication Slots
Streaming replication slots are a feature available since PostgreSQL 9.4. Using replication slots will cause the primary to retain WAL files until the primary has been notified that the replica has received the WAL file.
There is a tradeoff for keeping the WAL files on the primary until the primary is notified that the replica has received the WAL file. If the replica is unavailable for any reason, WAL files will accumulate on the primary until it can deliver them. On a busy database, with replicas unavailable, disk space can be consumed very quickly as unreplicated WAL files accumulate. This can also happen on a busy DB with relatively slow infrastructure (network and storage). Streaming replication slots are a very useful feature; if you do use them, carefully monitor disk space on the primary.
There is much more on replication slots and replication configuration generally.
If you have deployed HA across multiple datacenters, there's another layer of complexity. Crunchy HA provides example Ansible files for several MDC configurations. In the case where the primary data center fails and the DR data center has been promoted to primary, extra steps are required when you recover the original failed primary. In particular, when you start the recovered original primary, be certain that you start it as a DR data center. You will likely need to copy (rsync or other reliable mechanism) the pgbackrest repo from the new primary data center (old DR data center) when recovering the original primary data center.
What not to do / Don't do this
- Again, avoid just running pg_resetwal (pg_resetxlog). You may need to use it as a last resort, but don't start by running it.
- Do not try to cherry-pick or copy individual files from a backup / archive repo directly to the PostgreSQL the pg_wal (pg_xlog in earlier versions) directory. That is part of the database runtime that is managed by PostgreSQL. Deleting or copying things there can break your DB. Let the tools do their jobs.
- Never use OS tools to manually delete, modify or add files to the PostgreSQL pg_wal (pg_xlog) directory.
- Never use OS tools to manually delete, modify or add files to the pgBackRest repo. The pgBackRest repo contains state and metadata about current backups and which WAL files depend on which backups. (Note that listing or viewing the files in a pgbackrest repo can be helpful to diagnose replicas and restore issues. It's also OK to copy the repo in its entirety as long as you use a copy utility that verifies that the files have been copied correctly, for example rsync.)
June 19, 2020 •More by this author