As a database grows and scales up from a proof of concept to a full-fledged production instance, there are always a variety of growing pains that database administrators and systems administrators will run into.
Very often, the engineers on the Crunchy Data support team help support enterprise projects which start out as small, proof of concept systems, and are then promoted to large scale production uses. As these systems receive increased traffic load beyond their original proof-of-concept sizes, one issue may be observed in the Postgres logs as the following:
LOG: checkpoints are occurring too frequently (9 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size". LOG: checkpoints are occurring too frequently (2 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size".
This is a classic example of a database which has not been properly tuned for a high write load. In this post, we'll discuss what this means, some possible causes for this error, and some relatively simple ways to resolve the issue.
First, a look at the system settings and a brief discussion about what this error means.
The Postgres logs mentioned two specific things, checkpoints and max_wal_size. Investigating the Postgres instance to observe any settings related to these two items, we see the following:
select name, setting from pg_settings where name like '%wal_size%' or name like '%checkpoint%' order by name; name | setting ------------------------------+----------- checkpoint_completion_target | 0.9 checkpoint_flush_after | 32 checkpoint_timeout | 300 checkpoint_warning | 30 log_checkpoints | off max_wal_size | 1024 min_wal_size | 80 (7 rows)
max_wal_size sets the maximum amount of Write-Ahead-Logging (WAL) to grow between automatic checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, such as heavy load, a failing archive_command, or a high wal_keep_segments setting.
It should also be noted that increasing this parameter can increase the amount of time needed for crash recovery. The default value is 1GB (1024 MB).
As discussed in previous posts, the default configuration values for PostgreSQL are typically conservative, so as to work equally well on a large server as it would on a small, resource-constrained development machine. Because of this, it's likely that the default value observed here for max_wal_size is too low for the system generating the error messages we've seen.
Identifying the Issue
Next, let's look at why this low value for max_wal_size might be the related to the cause of the issue.
Obviously, the exact cause for this issue will vary from one situation to another, but generally speaking, when max_wal_size is low, and the database has a high number of updates or inserts happening quickly, it will tend to generate WAL faster than it can be archived, and faster than standard checkpoint processes can keep up.
As a result, if you have disk usage monitoring on your Postgres instance (you should!) you may also observe that the pg_wal directory increases in size dramatically as these WAL files are retained.
A brief aside:
There's a partner parameter for max_wal_size, which is it's opposite: min_wal_size. The parameter for min_wal_size defines the minimum size to shrink the WAL. As long as WAL disk usage stays below this setting while archiving, old WAL files are always recycled for future use at a checkpoint, rather than removed. This is useful to ensure that enough WAL space is reserved to handle spikes in WAL usage, for example when running large batch jobs. The default value for this is 80 MB.
How to Resolve
PostgreSQL helpfully informs us in the log file specifically what should be done: Increase the max_wal_size.
So, as suggested, edit the instance configuration files to increase the max_wal_size value to match the system's work load.
The ideal value, for most use cases, is to increase the value for max_wal_size such that it can hold at least one hour's worth of logs. The caveat here, however, is that you do not want to set this value extremely high, as it can increase the amount of time needed for crash recovery. If desired, the min_wal_size can also be increased, so that the system can handle spikes in WAL usage during batch jobs and other unusual circumstances. After making the appropriate configuration changes, and reloading Postgres, we can validate that the new settings are applied, as we expect:
name | setting ------------------------------+----------- checkpoint_completion_target | 0.9 checkpoint_flush_after | 32 checkpoint_timeout | 300 checkpoint_warning | 30 log_checkpoints | off max_wal_size | 16384 min_wal_size | 4096 (7 rows)
With these new settings in place, and with careful monitoring of the log files and system usage, the growing pains of scaling a system such as this up from a development device to a full-fledged production instance will be all but a distant memory.
For more information, and some interactive workshops on configuring PostgreSQL settings, please visit the Crunchy Postgres Developer Portal.
October 14, 2020 •More by this author