PostgreSQL Node Metrics: Don't Fly Blind

Avatar for Joe Conway

Joe Conway

13 min read

tl;dr -- If you are not actively monitoring your PostgreSQL clusters, and alerting on anomalous conditions, you are "flying blind". Don't do that!

At Crunchy Data, we distribute and support PostgreSQL as a product, in multiple forms, to many enterprise customers. We also make our products available via GitHub, our developer portal, and other places.

In particular, our team is well known for our Container Suite and Kubernetes Postgres Operator. Also pertinent to this blog, if less well known, is our pgMonitor project.

Recently we decided to bolster our monitoring capability as deployed by our Operator to include better node metrics. Specifically the goals were to:

  1. Gain access to certain Host/Operating System, and Kubernetes Pod (i.e. cgroup v1) specific, data

  2. Do so entirely via the SQL interface to PostgreSQL itself

    There are various existing methods to achieve these goals, but we decided to create a purpose-built PostgreSQL extension to facilitate the gathering of these metrics. We will not debate the wisdom of that decision here, although personally I think it has worked out extremely well.

    The extension is called pgnodemx. I also had additional goals for the extension to provide support for:

  3. Non-containerized PostgreSQL clusters

  4. Hosts running cgroup v2

The purpose of today's blog is to discuss this new extension in some detail.

Anatomy of the Solution

Any monitoring and alerting solution includes several components and this one is no different. In our case, pgnodemx enables the desired metrics to be collected via standard SQL. The pgmonitor package provides queries which are utilized by the embedded postgres_exporter to run the SQL functions installed by pgnodemx. It also handles setting up Prometheus and Grafana, which are used to store and present the metrics, respectively. Prometheus also provides an alerting capability. The Crunchy PostgreSQL Operator orchestrates the deployment of containers with these components and glues it all together.

pgnodemx

As mentioned above, pgnodemx is a PostgreSQL extension. It adds a variety of new functions to PostgreSQL which will be used to grab the desired data from our PostgreSQL cluster via standard SQL statements.

Installation and configuration

First things first: in order to install and configure pgnodemx in your own PostgreSQL instance, follow the instructions in the README. However, since the main focus of this blog is the containerized solution, the most likely way to deploy is using Crunchy Operator. The new version 4.5 of Operator and Containers will be released "Real Soon Now", and they will install and configure the extension and pgMonitor for you.

Provided Functionality

The functions provided by pgnodemx currently fall into the following categories:

  • cgroup
  • environment
  • /proc
  • Kubernetes Downward API
  • general

Now we will cover those categories one at a time.

These functions broadly fall into two subcategories. The first type are functions that read and parse the various cgroup file system virtual files and return the results in a naturally mapped form. The README calls these "General Access Functions". The second type of function returns information about the detected cgroup context.

Some examples of cgroup Related Functions:

userdb-# \x auto
Expanded display is used automatically.
userdb=# SELECT cgroup_mode();
 cgroup_mode
-------------
 legacy
(1 row)

userdb=# SELECT * FROM cgroup_path();
    controller    |              path
------------------+---------------------------------
 perf_event       | /sys/fs/cgroup/perf_event
 pids             | /sys/fs/cgroup/pids
 hugetlb          | /sys/fs/cgroup/hugetlb
 rdma             | /sys/fs/cgroup/rdma
 devices          | /sys/fs/cgroup/devices
 freezer          | /sys/fs/cgroup/freezer
 net_cls,net_prio | /sys/fs/cgroup/net_cls,net_prio
 cpu,cpuacct      | /sys/fs/cgroup/cpu,cpuacct
 blkio            | /sys/fs/cgroup/blkio
 memory           | /sys/fs/cgroup/memory
 cpuset           | /sys/fs/cgroup/cpuset
 systemd          | /sys/fs/cgroup/systemd
 cgroup           | /sys/fs/cgroup/memory
(13 rows)

userdb=# SELECT cgroup_process_count();
 cgroup_process_count
----------------------
                   18
(1 row)

userdb=# SELECT current_setting('pgnodemx.containerized');
 current_setting
-----------------
 on
(1 row)

userdb=# SELECT current_setting('pgnodemx.cgroup_enabled');
 current_setting
-----------------
 on
(1 row)

From the output you can see that the PostgreSQL cluster is running on a host which is in legacy mode. This implies that PostgreSQL is running under cgroup v1, which until very recently was always the case when running containerized under Kubernetes. The alternatives are unified and hybrid. The former is what you will get if running on a cgroup v2 host, e.g. with a modern Linux kernel under systemd and with appropriate configuration. The mode hybrid indicates that the host is configured for both cgroup v1 and v2 -- this mode is (at least currently) explicitly not supported by pgnodemx.

Although this blog is primarily focused on legacy Kubernetes installations, please allow a small sidebar discussion. If you did want to explore the use of pgnodemx on a host in unified mode under systemd, you probably need to add the following to your Linux kernel command line: systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all. Additionally you will want to override the installed PostgreSQL systemd service file, and add accounting (and possibly resource limits) for the controllers of interest, for example:

CPUAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
IOAccounting=yes
IPAccounting=yes

Anyway, returning to the code snippet above, you can see that cgroup_path() is a set-returning function which returns the resolved paths to all of the cgroup controllers. The process count for the cgroup, which in Kubernetes-land corresponds to the Pod in which PostgreSQL is running, is given by the scalar function unsurprisingly named cgroup_process_count().

Finally, the built-in PostgreSQL function current_setting() is used to inquire about the values of two key configuration parameters controlling behavior of the extension. These parameters have generally sensible defaults, and the extension attempts to override them as required by the host context at PostgreSQL startup (pgnodemx must be loaded via shared_preload_libraries), but if you need you can set these values in postgresql.conf yourself. Note that pgnodemx.cgroup_enabled = off can be used, and indeed is set by pgnodemx when required, in order to disable cgroup support entirely.

Now let's look at some of the "General Access Functions". This set of examples shows how it is possible to read the current memory usage, the given memory limit, and the relevant CPU limits for the cgroup (Kubernetes Pod):

userdb=# SELECT cgroup_scalar_bigint('memory.usage_in_bytes');
 cgroup_scalar_bigint
----------------------
            241020928
(1 row)

userdb=# SELECT cgroup_scalar_float8('memory.usage_in_bytes');
 cgroup_scalar_float8
----------------------
            240971776
(1 row)

userdb=# SELECT cgroup_scalar_text('memory.usage_in_bytes');
 cgroup_scalar_text
--------------------
 240971776
(1 row)

userdb=# SELECT cgroup_scalar_bigint('memory.limit_in_bytes');
 cgroup_scalar_bigint
----------------------
            536870912
(1 row)

userdb=# SELECT cgroup_scalar_bigint('cpu.cfs_period_us');
 cgroup_scalar_bigint
----------------------
               100000
(1 row)

userdb=# SELECT cgroup_scalar_bigint('cpu.cfs_quota_us');
 cgroup_scalar_bigint
----------------------
                10000
(1 row)

In the snippet above, general access functions are shown that are designed to read cgroup virtual files that return a single row of a single scalar value. There are BIGINT, TEXT, and FLOAT8 variants of the scalar access type function. These functions will do their best to coerce the characters read from the virtual file into the matching data type. If the characters cannot be coerced, and ERROR will be returned.

Next let's see some expected failure cases:

userdb=# -- should return NULL
userdb=# SELECT cgroup_scalar_bigint(null);
 cgroup_scalar_bigint
----------------------

(1 row)

userdb=# -- should fail
userdb=# SELECT cgroup_scalar_bigint('bar/../../etc/memory.usage_in_bytes');
ERROR:  reference to parent directory ("..") not allowed
userdb=# -- should fail
userdb=# SELECT cgroup_scalar_bigint('/memory.usage_in_bytes');
ERROR:  reference to absolute path not allowed
userdb=# CREATE USER pgnodemx_joe;
CREATE ROLE
userdb=# SET SESSION AUTHORIZATION pgnodemx_joe;
SET
userdb=> -- should fail
userdb=> SELECT cgroup_scalar_bigint('memory.usage_in_bytes');
ERROR:  must be member of pg_monitor role
userdb=> RESET SESSION AUTHORIZATION;
RESET
userdb=# DROP USER pgnodemx_joe;
DROP ROLE

Here we see that bad arguments will result in either NULL (for NULL input) or an ERROR. Generally speaking the cgroup general access functions all take a virtual file name as an argument. The filename must not be NULL, and must not be absolute or parent referencing relative paths. Also, the invoking user must be a member of the built-in pg_monitor role (or an administrator created pgmonitor for PostgreSQL version 9.6 or earlier).

There are several other types of general access functions. Rather than try to show them all here, please see the README and try out the extension for yourself. However at the risk of providing too much detail, one more example might be interesting:

userdb=# SELECT * FROM cgroup_setof_kv('cpu.stat');
      key       |      val
----------------+---------------
 nr_periods     |        384767
 nr_throttled   |         36711
 throttled_time | 2599819133882
(3 rows)

userdb=# SELECT * FROM cgroup_setof_kv('memory.stat');
            key            |    val
---------------------------+-----------
 cache                     | 168390656
 rss                       |  61177856
 rss_huge                  |         0
 shmem                     |  15138816
 mapped_file               |  15138816
 dirty                     |    405504
 writeback                 |         0
 swap                      |         0
 pgpgin                    |  19621536
 pgpgout                   |  19565471
 pgfault                   |  40182615
 pgmajfault                |        66
 inactive_anon             |  15142912
 active_anon               |  61280256
 inactive_file             |  18243584
 active_file               | 135045120
 unevictable               |         0
 hierarchical_memory_limit | 536870912
 hierarchical_memsw_limit  | 536870912
 total_cache               | 168390656
 total_rss                 |  61177856
 total_rss_huge            |         0
 total_shmem               |  15138816
 total_mapped_file         |  15138816
 total_dirty               |    405504
 total_writeback           |         0
 total_swap                |         0
 total_pgpgin              |  19621536
 total_pgpgout             |  19565471
 total_pgfault             |  40182615
 total_pgmajfault          |        66
 total_inactive_anon       |  15142912
 total_active_anon         |  61280256
 total_inactive_file       |  18243584
 total_active_file         | 135045120
 total_unevictable         |         0
(36 rows)

userdb=#
userdb=# SELECT * FROM cgroup_setof_ksv('blkio.throttle.io_serviced');
 key  | subkey |  val
------+--------+-------
 8:16 | Read   |   171
 8:16 | Write  | 33077
 8:16 | Sync   | 19608
 8:16 | Async  | 13640
 8:16 | Total  | 33248
 8:0  | Read   |   124
 8:0  | Write  |    34
 8:0  | Sync   |   152
 8:0  | Async  |     6
 8:0  | Total  |   158
 all  | Total  | 33406
(11 rows)

In this example there are two types of access functions: cgroup_setof_kv() and cgroup_setof_ksv(). These are appropriate when the cgroup virtual file contains multiple rows in key-value pairs or key-subkey-value triplets respectively. The snippet shows how it is possible to obtain fairly detailed information regarding CPU, memory, and I/O usage of the cgroup/Pod. Neat stuff if I do say so myself!

Let's skip over the other function types for the moment and touch on the Kubernetes Downward API related functions, because they are similar in some ways to the cgroup related functions. This API exposes Pod and Container information through both environment variables as well as "Volume Files". We'll discuss a set of generic functions for reading environment variables in the next section. In pgnodemx the Kubernetes Downward API support functions are specifically targeted at the "Volume Files". See the documentation linked above for details with respect to how to create those files and what information may be exposed. At least one interesting bit of info that is difficult to obtain otherwise is the Kubernetes "request" values for things like CPU and memory. Since these are not mapped directly to cgroup virtual files anyway (at least as far as I have been able to determine), this seems to be the only way to get that information via introspection from PostgreSQL.

Enough talk, here are several code examples:

userdb=# SELECT * FROM kdapi_setof_kv('labels');
         key          |                 val
----------------------+--------------------------------------
 archive-timeout      | 60
 crunchy-pgha-scope   | test1
 crunchy_collect      | false
 deployment-name      | test1
 name                 | test1
 pg-cluster           | test1
 pg-cluster-id        | 9ecac2f7-7fbc-4469-acbc-ee3deaea4d39
 pg-pod-anti-affinity |
 pgo-pg-database      | true
 pgo-version          | 4.2.2
 pgouser              | admin
 pod-template-hash    | 577d8fcdb8
 role                 | master
 service-name         | test1
 vendor               | crunchydata
 workflowid           | 7dd34f9e-c8c3-49e8-9f33-05848147d275
(16 rows)

-- edited for width --
userdb=# SELECT * FROM kdapi_setof_kv('annotations');
-[ RECORD 1 ]---------
key | kubernetes.io/config.seen
val | 2020-07-25T18:07:13.14360097Z
-[ RECORD 2 ]---------
key | kubernetes.io/config.source
val | api
-[ RECORD 3 ]---------
key | status
val | {\"conn_url\":\"postgres://10.28.1.79:5432/postgres\",
       \"api_url\":\"http://10.28.1.79:8009/patroni\",
       \"state\":\"running\",
       \"role\":\"master\",
       \"version\":\"1.6.4\",
       \"xlog_location\":1124074208,
       \"timeline\":15}

-- edited for width --
userdb=# SELECT replace(val,'\"','"')::jsonb
         FROM kdapi_setof_kv('annotations')
         WHERE key = 'status';
     replace
----------------------------------------------------
 {"role": "master",
  "state": "running",
  "api_url": "http://10.28.1.79:8009/patroni",
  "version": "1.6.4",
  "conn_url": "postgres://10.28.1.79:5432/postgres",
  "timeline": 15,
  "xlog_location": 1124074208}
(1 row)

userdb=# SELECT * FROM kdapi_scalar_bigint('cpu_limit');
 kdapi_scalar_bigint
---------------------
                   1
(1 row)

userdb=# SELECT * FROM kdapi_scalar_bigint('cpu_request');
 kdapi_scalar_bigint
---------------------
                   1
(1 row)

userdb=# SELECT * FROM kdapi_scalar_bigint('mem_limit');
 kdapi_scalar_bigint
---------------------
           536870912
(1 row)

userdb=# SELECT * FROM kdapi_scalar_bigint('mem_request');
 kdapi_scalar_bigint
---------------------
           536870912
(1 row)

Similar to the cgroup facility, pgnodemx.kdapi_enabled = off can be used, and indeed is set by pgnodemx when required, in order to disable Kubernetes Downward API support entirely.

In the above code the use of set-returning function kdapi_setof_kv() and scalar function kdapi_scalar_bigint() are illustrated. The memory and CPU request values are easily obtained via this facility. There is also an example of reconstituting the status annotation from Kubernetes as a PostgreSQL jsonb value.

Environment, /proc, and "Other" Functions

As alluded to above, there are two provided functions for reading scalar environment variables and coercing them to an appropriate PostgreSQL data type: pgnodemx_envvar_text() and pgnodemx_envvar_bigint(). Their use is pretty straightforward and can be seen in the README.

There are also several functions that read and parse specific /proc virtual files and return the information contained in a suitably mapped form. Their use is also shown in the documentation, so to keep this blog from becoming too obnoxiously long we'll ask you to "read the fine manual". However I would like to leave you with one more interesting query that combines two of the /proc functions with a general system information function, fsinfo(). This last function does not read from any virtual file but instead gets information directly from kernel syscalls. On second thought we will also look at one of the /proc functions. Without further ado:

userdb=# SELECT interface,
                rx_bytes, rx_packets,
                tx_bytes, tx_packets
         FROM proc_network_stats();
 interface | rx_bytes | rx_packets | tx_bytes | tx_packets
-----------+----------+------------+----------+------------
 lo        | 27307979 |     235130 | 27307979 |     235130
 eth0      | 13733198 |      37423 | 14687572 |      37565
(2 rows)

userdb=# SELECT *
         FROM proc_mountinfo() m
         JOIN proc_diskstats() d
           USING (major_number, minor_number)
         JOIN fsinfo(current_setting('data_directory')) f
           USING (major_number, minor_number);
-[ RECORD 1 ]--------------------+----------------
major_number                     | 8
minor_number                     | 16
mount_id                         | 2399
parent_id                        | 2327
root                             | /
mount_point                      | /pgdata
mount_options                    | rw,relatime
fs_type                          | ext4
mount_source                     | /dev/sdb
super_options                    | rw,data=ordered
device_name                      | sdb
reads_completed_successfully     | 2382
reads_merged                     | 258
sectors_read                     | 491518
time_spent_reading_ms            | 19976
writes_completed                 | 1268281
writes_merged                    | 953013
sectors_written                  | 29847376
time_spent_writing_ms            | 17189872
ios_currently_in_progress        | 0
time_spent_doing_ios_ms          | 4577330
weighted_time_spent_doing_ios_ms | 12428733
type                             | ext2
block_size                       | 4096
blocks                           | 249830
total_bytes                      | 1023303680
free_blocks                      | 175372
free_bytes                       | 718323712
available_blocks                 | 171276
available_bytes                  | 701546496
total_file_nodes                 | 65536
free_file_nodes                  | 64156
mount_flags                      | relatime

The first query here gives us network I/O on the available interfaces, while the second gives fairly comprehensive information about the file system on which our PostgreSQL data directory is mounted.

Other Monitoring Solution Components

As already mentioned, the complete solution involves a postgres_exporter to run the queries of interest, Prometheus to store the data and generate alerts, Grafana to do some final tweaking of the data (in particular, generate deltas) and produce the beautiful displays, and pgmonitor to glue it all together. And of course in Kubernetes-land the Crunchy Container Suite which includes all these components and the Postgres Operator to deploy it all.

The actual queries for the inaugural version of this can be seen in the repository for the pgmonitor project. And since no blog would be complete without at least one pretty picture, here is an example captured from the new Grafana dashboards by my colleague Jonathan Katz:

Pod metrics from pgnodemx

Conclusion

pgnodemx is a brand new PostgreSQL extension which enables you to grab all kinds of interesting host-node-level metrics which you can then trend, observe, and alert on when necessary. There is undoubtedly much as yet unmapped host data, particularly in the /proc virtual files system, but likely elsewhere as well.

I'd love to get feedback for an expanded set of real world requirements for that-which-is-yet-unmapped, so please try out pgnodemx and let us know how it works for you!