If you run Linux in production for any significant amount of time, you have likely run into the "Linux Assassin" that is, the OOM (out-of-memory) killer. When Linux detects that the system is using too much memory, it will identify processes for termination and, well, assassinate them. The OOM killer has a noble role in ensuring a system does not run out of memory, but this can lead to unintended consequences.
For years the PostgreSQL community has made recommendations on how to set up Linux systems to keep the Linux Assassin away from PostgreSQL processes, which I will describe below. These recommendations carried forward from bare metal machines to virtual machines, but what about containers and Kubernetes?
Below is an explanation of experiments and observations I've made on how the Linux Assassin works in conjunction with containers and Kubernetes, and methods to keep it away from PostgreSQL clusters in your environment.
The first PostgreSQL community mailing list thread on the topic is
and the first
is right about the same time. The exact method suggested to skirt the Linux OOM
Killer has changed slightly since that time, but it was, and currently still is,
avoid memory overcommit
i.e. in recent years by setting
Avoidance of memory overcommit means that when a PostgreSQL backend process requests memory and the request cannot be met, the kernel returns an error which PostgreSQL handles appropriately. Therefore, although the offending client then receives an error from PostgreSQL, importantly the client connection is not killed, nor are any other PostgreSQL child processes (see below).
In addition, or when that is not possible, the
oom_score_adj=-1000 for the parent "postmaster" process via the
privileged startup mechanism (e.g. service script or systemd unit file), and
oom_score_adj=0 for all child processes via two environment variables
that are read during child process startup. This ensures that should the OOM
killer need to reap one or more processes, the postmaster
will be protected,
and the most likely candidate to get killed will be a client backend. That way
the damage can be minimized.
It is worth a small detour to cover the OOM Killer in a bit more detail in order
to understand what
oom_score_adj does. However the true details are complex,
with a long sordid history (certainly not all inclusive, but for a nice summary
of articles on the OOM killer see
LWN), so this description is still
At the host OS level, when the system becomes too short of memory, the OOM
killer kicks in. In a nutshell, it will determine which process has the highest
oom_score, and kill it with a SIGKILL signal. The value of
oom_score for a process is essentially "percentage of host memory consumed by
this process" times 10 (let's call that "memory score"), plus
The value of
oom_score_adj may be set to any value in the range -1000 to
+1000, inclusive. As mentioned above, note that
oom_score_adj=-1000 is a
magic value in that the OOM killer will never reap a process with this
Combining these two bits of kernel trivia result in the value of
ranging from 0 to 2000. For example a process with
uses 100% of host memory (i.e. a "score" of 1000) has an
oom_score equal to 2
(1000 + -998), and a process with
oom_score_adj=500 that uses 50% of host
memory (i.e. a "memory score" of 500) has an
oom_score equal to 1000 (500 +
500). Obviously this means that a process consuming a large portion of system
memory with a high
oom_score_adj is at or near the top of the list for the OOM
The OOM killer works pretty much the same at the CGroup level, except a couple small but important differences.
First of all, the OOM killer is triggered when the sum of memory consumed by the
cgroup processes exceeds the assigned cgroup memory limit. While running a shell
in a container, the former can be read from
/sys/fs/cgroup/memory/memory.usage_in_bytes and the latter from
Secondly, only processes within the offending cgroup are targeted. But the
cgroup process with the highest
oom_score is still the first one to go.
Some of the reasons for this emphasis on OOM Killer avoidance are:
- Lost committed transactions: if the postmaster (or in HA setups the controlling Patroni processes) are killed, and replication is asynchronous (which is usually the case), transactions that have been committed on the primary database may be lost entirely when the database cluster fails over to a replica.
- Lost active connections: if a client backend process is killed, the postmaster assumes shared memory may have been corrupted, and as a result it kills all active database connections and goes into crash recovery (rolls forward through transaction logs since the last checkpoint).
- Lost inflight transactions: when client backend processes are killed, transactions which have been started but not committed will be lost entirely. At that point the client application is the only source for the inflight data.
- Down time: A PostgreSQL cluster has only a single writable primary node. If it goes down, at least some application down time is incurred.
- Reset statistics: the crash recovery process causes collected statistics to be reset (i.e. zeroed out). This affects maintenance operations such as autovacuum and autoanalyze, which in turn will cause performance degradation, or in severe cases outages (e.g. due to out of disk space). It also affects the integrity of monitoring data collected on PostgreSQL, potentially causing lost alerts.
Undoubtedly there are others neglected here.
There are several problems related to the OOM killer when PostgreSQL is run under Kubernetes which are noteworthy:
Kubernetes actively sets
vm.overcommit_memory=1. This leads to promiscuous
overcommit behavior and is in direct contrast with PostgreSQL best practice. It
greatly increases the probability that OOM Killer reaping will be necessary.
Even worse, an OOM kill can happen even when the host node does not have any memory pressure. When the memory usage of a cgroup (pod) exceeds its memory limit, the OOM killer will reap one or more processes in the cgroup.
oom_score_adj values are almost completely out of control of the PostgreSQL pods, preventing any attempt at following the long established best practices described above. I have created an issue on the Kubernetes github for this, but unfortunately it has not gotten much traction.
Kubernetes defaults to enforcing swap disabled. This is directly in opposition of the recommendation of Linux kernel developers. For example, see Chris Down's excellent blog on why swap should not be disabled. In particular I have observed dysfunctional behaviors in memory constrained cgroups when switching from I/O dominant workloads to anonymous memory intensive ones. Evidence of other folks who have run into this issue can be seen in this article discussing the need for swap:
"There is also a known issue with memory cgroups, buffer cache and the OOM killer. If you don’t use cgroups and you’re short on memory, the kernel is able to start flushing dirty and clean cache, reclaim some of that memory and give it to whoever needs it. In the case of cgroups, for some reason, there is no such reclaim logic for the clean cache, and the kernel prefers to trigger the OOM killer, who then gets rid of some useful process."
There is also an issue on the Kubernetes github for this problem, which is still being debated three + years later.
Kubernetes defines 3 Quality of Service (QoS) levels. They impact more than just OOM killer behavior, but for the purposes of this paper only the OOM killer behavior will be addressed. The levels are:
- Guaranteed: the memory limit and request are both set and equal for all containers in the pod.
- Burstable: no memory limit, but with a memory request for all containers in the pod.
- Best Effort: everything else.
With a Guaranteed QoS pod the values for
oom_score_adj are almost as desired;
PostgreSQL might not be targeted in a host memory pressure scenario. But the
cgroup "kill if memory limit exceeded" behavior is undesirable. Relevant
characteristics are as follows:
oom_score_adj=-998: this is good, but not the recommended -1000 (OOM killer disabled).
- The documented environment variables are able to successfully reset
oom_score_adj=0for the postmaster children which is also good.
With a Burstable QoS pod,
oom_score_adj values are set very high, and with
surprising semantics (smaller requested memory leads to higher
This makes PostgreSQL a prime target if/when the host node is under memory
pressure. If the host node had
vm.overcommit_memory=2, this situation would be
tolerable because OOM kills would be unlikely if not impossible. However, as
noted above, Kubernetes recommends/sets
characteristics are as follows:
- The cgroup memory constraint OOM killer behavior does not apply -- this is good
oom_score_adj=(1000 - 10 * (percent avail mem requested))(this is a slight simplification -- there is also an enforced minimum value of 2, and maximum value of 999): this leads to very small pod getting higher score adjust value than very large one. E.g. a pod requesting 1% available memory will get
oom_score_adj=990while one requesting 50% available memory will get
oom_score_adj=500. This in turn means that if the smaller pod is idle, using essentially no resources it might, for example have
oom_score=(0.1*10)+990=991while the larger pod might be using 40% of system memory and get
- The ideal solution would be if the kernel would provide a mechanism to allow
equivalent behavior to
vm.overcommit_memory=2, except acting at the cgroup level. In other words, allow a process making excess memory request within a cgroup to receive an "out of memory" error instead of using the OOM Killer to enforce the constraint. This would be the ideal solution because most users seem to want Guaranteed QoS pods, but currently the memory limit enforcement via OOM killer is a problem.
- Another desired change is for Kubernetes to provide a mechanism to allow
certain pods (with suitable RBAC controls on which ones) to override the
oom_score_adjvalues which are currently set based on QoS heuristics. This would allow PostgreSQL pods to actively set
oom_score_adjto recommended values. Hence the PostgreSQL postmaster process could have the recommended
oom_score_adj=-1000, the PostgreSQL child processes could be set to
oom_score_adj=0, and Burstable QoS pods would be a more reasonable alternative.
- Finally, running Kubernetes with swap enabled should not be such a no-no. It took some digging, and I have not personally tested it, but a workaround is mentioned in the very long GitHub issue discussed earlier.
In typical production scenarios the OOM killer semantics described above may never be an issue. Essentially, if your pods are sized well, hopefully based on testing and experience, and you do not allow execution of arbitrary SQL, the OOM killer will probably never strike.
On development systems, OOM killer action might be more likely to occur, but probably not so often as to be a real problem.
However, if the OOM killer has caused distress or consternation in your environment, here are some suggested workarounds.
- Ensure your pod is Guaranteed QoS (memory limit and memory request sizes set the same).
- Monitor cgroup memory usage and alert on a fairly conservative threshold,
50% of the memory limit setting.
- Monitor and alert on OOM Killer events.
- Adjust memory limit/request for the actual maximum memory use based on
- Ensure your pod is Burstable QoS (with a memory request, but without a memory limit).
- Monitor Kubernetes host memory usage and alert on a fairly conservative
threshold, e.g. 50% of physical memory.
- Monitor and alert on OOM Killer events.
- Adjust Kubernetes host settings to ensure OOM killer is never invoked.
- Accept the fact that some OOM Killer events will occur. Monitoring history
will inform the statistical likelihood and expected frequency of occurrence.
- Ensure your application is prepared to retry transactions for lost connections.
- Run a High Availability cluster.
- Depending on actual workload and usage patterns, the OOM killer event.
frequency may be equal or nearly equal to zero.
Crunchy Data is actively working with the PostgreSQL, Kubernetes, and Linux Kernel communities to improve the OOM killer behavior. Some possible longer term solutions include:
- Linux kernel: cgroup level
oom_score_adjoverride control, swap enablement normalized
- Crunchy: Explore possible benefits from using cgroup v2 under kube 1.19+
The dreaded Linux Assassin has been around for many years and shows no signs of retiring soon. But you can avoid being targeted through careful planning, configuration, monitoring, and alerting. The world of containers and Kubernetes brings new challenges, but the requirements for diligent system administration remain very much the same.
February 9, 2021 •More by this author