Patroni & etcd in High Availability Environments

Crunchy Data products often include High Availability. Patroni and etcd are two of our go-to tools for managing those environments. Today I wanted to explore how these work together. Patroni relies on proper operation of the etcd cluster to decide what to do with PostgreSQL. When communication between these two pieces breaks down, it creates instability in the environment resulting in failover, cluster restart, and even the the loss of a primary database. To fully understand the importance of this relationship, we need to understand a few core concepts of how these pieces work. First, we'll start with a brief overview of the components involved in HA systems and their role in the environment.

Overview of HA Infrastructure

HA systems can be setup in a single or multi-datacenter configuration. Crunchy supports HA on cloud, traditional, or containerized infrastructure. When used in a single datacenter, the environment is typically setup as a 3-node cluster on three separate database hosts. When used across multiple datacenters, the environment typically has an active datacenter, where the primary HA cluster and applications are running, and one or more standby datacenters, each containing a replica HA cluster that is always available. Although the setup may be different, the basic components and primary function of the environment remains the same.

Main HA components

For this article, we will focus on three basic components which are essential in both single datacenter and multi-datacenter environments:

PostgreSQL cluster: the database cluster, usually consisting of a primary and two or more replicas
Patroni: used as the failover management utility
etcd: used as a distributed configuration store (DCS), containing cluster information such as configuration, health, and current status.

How HA components work together

Each PostgreSQL instance within the cluster has one application database. These instances are kept in sync through streaming replication. Each database host has its own Patroni instance which monitors the health of its PostgreSQL database and stores this information in etcd. The Patroni instances use this data to:

keep track of which database instance is primary
maintain quorum among available replicas and keep track of which replica is the most "current"
determine what to do in order to keep the cluster healthy as a whole

Patroni manages the instances by periodically sending out a heartbeat request to etcd which communicates the health and status of the PostgreSQL instance. etcd records this information and sends a response back to Patroni. The process is similar to a heart monitoring device. Consistent, periodic pulses indicate a healthy database.

etcd Consensus Protocol

The etcd consensus protocol requires etcd cluster members to write every request down to disk, making it very sensitive to disk write latency. If Patroni receives an answer from etcd indicating the primary is healthy before the heartbeat times out, the replicas will continue to follow the current primary.

If the etcd system cannot verify writes before the heartbeats time out, or if the primary instance fails to renew its status as leader, Patroni will assume the cluster member is unhealthy and put the database into a fail-safe configuration. This will trigger an election to promote a new leader and the old primary is demoted and becomes a replica.

etcd consensus

Common Causes of Communication Failures

Communication failure between Patroni and etcd is one of the most common reasons for failover in HA environments. Some of the most common reasons for communication issues are:

an under-resourced file system
I/O contention in the environment
network transit timeouts

Under-resourced file system

Because HA solutions must be sufficiently resourced at all points at all times to work well, the proper resources must be available to the etcd server in order to mitigate failovers. As mentioned before, etcd consensus protocol requires etcd cluster members to write every request down to disk and every time a key is updated for a cluster, a new revision is created. When the system runs low on space (usage above 75%), etcd goes read/delete only until revisions and keys are removed or disk space is added. For optimal performance, we recommend keeping disk usage below 75%.

I/O contention

The etcd consensus protocol requires etcd cluster members to write every request down to disk, making it very sensitive to disk write latency. Systems under heavy loads, particularly during peak or maintenance hours, are susceptible to I/O bottlenecks as processes are forced to compete for resources. This contention can increase I/O wait time and prevent Patroni from receiving an answer from etcd before the heartbeat times out. This is especially true when running virtual machines as neighboring machines can impact I/O. Other sources of contention might be heavy I/O from PostgreSQL and excessive paging due to high connection rates and/or memory starvation.

Network Delay

The etcd system, which is critical for the stability of the HA solution, is experiencing issues that register as network transit timeouts. This could be due to either actual network timeouts or massive resource starvation at the etcd level. If you notice timeout errors typically coincide with periods of heavy network traffic, then network delay could be the root cause of these timeout errors.

Diagnosing the system

Confirm the issue

When troubleshooting your system, the best place to start is by checking the logs. If communication issues between Patroni and etcd are at the heart of the issue, you will most likely see errors in your log files like the examples below.

First, check PostgreSQL logs in order to rule out any issues with the PostgreSQL host itself. By default, these logs are stored under pg_log in the PostgreSQL data directory. If your logs are not in the default location, you can determine the exact location by running the command show log_directory ; in the database. Check for any indication of other PostgreSQL processes crashing or being killed prior to the error message. For example:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

If no other PostgreSQL processes crashed, check the Patroni logs for any errors or events that may have occurred shortly before this error was logged by PostgreSQL. Messages like demoted self because failed to update leader lock in DCS and Loop time exceeded indicate communication and timeout issues with etcd.

Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,510 INFO: Selected new etcd server http://10.84.32.146:2379
Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,683 INFO: demoted self because failed to update leader lock in DCS
Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,684 WARNING: Loop time exceeded, rescheduling immediately.
Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,686 INFO: closed patroni connection to the postgresql cluster
Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,705 INFO: Lock owner: None; I am pg1
Feb  1 13:45:05 patroni: 2021-02-01 13:45:05,706 INFO: not healthy enough for leader race
Feb  1 13:45:06 patroni: 2021-02-01 13:45:06,657 INFO: starting after demotion in progress
Feb  1 13:45:09 patroni: 2021-02-01 13:45:09,521 INFO: postmaster pid=1521
Feb  1 13:45:11 patroni: /var/run/postgresql:5434 - rejecting connections

If you see a message like the one above, the next step is to check etcd logs. Look for messages logged right before the message logged in the Patroni logs. If communication issues are to blame for your environment's behavior, you will likely see errors like the one below.

Feb  1 13:44:21 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.177252ms)
Feb  1 13:44:21 etcd: server is likely overloaded

Narrow down the cause

Check disk space

To see if a lack of disk space is the root of the problem, check the disk space available to the etcd system by running the linux command df on the etcd directory, typically /var/lib/etcd. The disk space available to this directory should be checked on all servers, including the Patroni server. For optimal performance, we recommend keeping disk usage below 75%. If the amount of space used is approaching or exceeding 75%, then allocating more space to this directory may resolve the issue. (More information in the Recommendation section further below.)

Analyze performance metrics

If the file system still has plenty of space available, we will need to dig deeper to find the source of the problem by analyzing the overall performance of the system. A good place to start is with the sar command which is part of the Linux sysstat package and can be run on the system by any user. This command provides additional information about the system, such as system load, memory and CPU usage which can be used to pinpoint any bottlenecks or pain points in your system. By default, the command displays CPU activity and collects these statistics every 10 minutes.

The nice thing about sar is that it stores historical data by default with a one-month retention. On RHEL/CentOS/Fedora distributions, this data is stored under /var/log/sa/ for Debian/Ubuntu systems, it's stored under /var/log/sysstat/. The log files are named sa*dd*, where dd represents the day of the month. For example, the log file for the first of the month would be sa01, the file for the 15th would be sa15.

This means that if the sysstat package was installed and running on the server when the etcd timeout occurred, we can go back and analyze the performance data around the time of the incident. Note: Because the sar command only reports on local activities, each of the servers in the etcd quorum will need to be checked. If the sysstat package was not installed or was not running during the time of the incident, it will need to be installed and enabled so that this information will be available the next time the etcd timeout issue occurs. For our purposes, we will assume the package was running.

Going back to the etcd log example we looked at earlier, we can see that the timeout issue occurred at 13:44:21 on the first of the month. By specifying the relevant file name along with a start and end time in our sar command, we can extract the information relevant to the time of the incident. Note: Use a start time slightly before the timestamp of the error in order to see the state of the system before the timeout was triggered. For example:

sar -f /var/log/sa/sa01 -s 13:35:00 -e 13:50:00

Where:

-f: file name and path
-s: start time, in HH:MM:SS format
-e: end time, in HH:MM:SS format

Should give us an output that looks something like:

$ sar -f /var/log/sa/sa01 -s 13:30:00 -e 13:51:00

01:30:01 PM     CPU      %usr     %nice      %sys    %iowait    %steal    %idle
01:40:01 PM     all      2.71      0.00      2.02      0.92      0.00     94.32
01:50:01 PM     all      2.10      0.00      1.79      7.86      0.00     88.22
Average:        all      2.41      0.00      1.91      4.39      0.00     91.27

Here we can see a jump in %iowait between 1:40pm and 1:50pm, indicating a sudden burst of activity around the time of the etcd error, as suspected.

Adding the -d flag to the command will let us take a closer look at each device block and allow us to compare how long the I/O request took from start to finish (await column) with how long the requests actually took to complete (the svctm column):

$ sar -f /var/log/sa/sa01 -s 13:30:00 -e 13:51:00 -d

01:30:01 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:40:01 PM    dev8-0      4.36      2.19     49.63     11.88      0.03      7.59      5.67      2.47
01:40:01 PM    dev8-1      0.66      1.92    317.55    480.49      0.05     74.76      6.63      0.44
01:40:01 PM    dev8-2      0.09      0.85      1.05     21.13      0.00      4.31      4.31      0.04
01:40:01 PM    dev8-3      7.14      0.43    175.16     24.58      0.06      7.97      3.81      2.72
01:50:01 PM    dev8-0      4.40      1.91     45.88     10.86      1.00    226.16     28.65     12.61
01:50:01 PM    dev8-1      0.48      0.00     10.07     20.83      0.12    245.13     73.72      3.56
01:50:01 PM    dev8-2      0.24      0.00     57.48    241.24      0.27   1123.66    235.00      5.60
01:50:01 PM    dev8-3      6.91      0.01    165.57     23.98      0.78    112.61     19.24     13.28

Since requests can lose time waiting in a queue if the device is already busy and won't accept additional concurrent requests, it is not unusual for the service time to be slightly smaller than the waiting time. However, in this example we can see that I/O requests on dev8-2 took an average of 1123.66ms from start to finish even though they only took an average of 235ms to actually complete, which is a significant increase from where is was previously, when both await and svctm only took 4.31ms. Considering these times are averages, it isn't hard to imagine that any spikes that may have occurred were likely much higher than the time shown in this output. If you find similar jumps in your environment, then an under-resourced system is likely the cause of the timeout errors.

Solutions and Suggested Steps

Now that we have a better idea of what might be causing the issue, here are some things we can do to fix it:

Increase resources

If the etcd directory is running low on space (i.e. the amount of space used is approaching or exceeding 75%), allocate more disk space to this directory and see if the heartbeat timeout issue is resolved. Similarly, if you find spikes of %iowait and I/O contention that correlate with the time of the timeout incident, we recommend increasing the IOPS on all systems running the etcd quorum.

Find the cause of I/O spikes

While increasing resources may help in the short term, identifying the cause of the await jump is key to determining a long-term solution. Work with your systems administrator to diagnose and resolve the underlying cause of I/O contention in your environment.

Relocate your etcd

If etcd is sharing a storage device with another resource, consider relocating the etcd data to its own dedicated device to ensure that etcd has a dedicated I/O queue for any I/O that it needs. In a multi-node environment, this means one node should be dedicated entirely to etcd. For optimum performance choose a device with low-latency networking and low-latency storage I/O.

Please note: If the underlying issue is the disk itself, rather than just the disk performance, moving the etcd data to a new storage device may not fully correct the issue if other parts of the cluster are still reliant on the disk.

Resolve network delay

If communication errors persist after you have increased resources to the etcd system, then the only remaining cause is network delay. For a long-term solution, you will need to work with your network administrator to diagnose and resolve the underlying cause of network delay in your environment.

Increase Timeout Intervals

While increasing resources and resolving the underlying issue of I/O contention and/or network delay are the only way to fully resolve the issue, a short-term solution to the problem would be to increase the timeout interval. This will give etcd more time to verify and write requests to disk before timing out and Patroni triggers an election.

If you are using Crunchy Data's High-Availability solution, this can be accomplished by changing the heartbeat_interval parameter in your group_vars/etcd.yml file and rerunning your playbook. Below is an example as to how it should look like:

etcd_user_member_parameters:
  heartbeat_interval: <value>

If you are using another solution in your environment, you should be able to increase this setting by changing the parameter in your etcd configuration file, typically located under /etc/default/etcd.

IMPORTANT: Setting the heartbeat interval to a value that's too high will result in long election timeouts and the etcd cluster will take longer to detect leader failure. This should be treated as a last-ditch effort and only used as a way of mitigating the issue until the underlying cause can be diagnosed and resolved.

Conclusion

Hopefully you now have a better understanding of why Patroni's timely and consistent communication with etcd is essential to maintaining a healthy HA environment as well what you can do to diagnose and fix communication issues between the two.

Crunchy Data strongly recommends ensuring a good, reliable network to your DCS to prevent failover from occurring. We also strongly recommend monitoring your environment for disk space issues, archiving issues, failover occurrences, and replication slot failures.

Latest Articles