Disk IOPS (Input/Output Operations Per Second) is a key metric for measuring the performance of any disk system. It represents the number of read and write operations that can be performed per second. For PostgreSQL, which relies heavily on disk access, understanding and optimizing disk IOPS is crucial for achieving optimal performance. Today I want to go through the key topics involved with IOPs - what is it, how does it impact Postgres, how to measure it, and what to change for better performance.
At a high level, an I/O operation is a request to either read data (”Input”) from or write data (”Output”) to a disk, typically measured in operations per second.
You might see WOPS, write operations per second, or ROPS, read operations per second. In general when talking about IOPS we mean the sum of both read and write operations on a particular disk volume. This is a low level operation handled by the operating system; an application (including PostgreSQL) doesn’t have to worry about how much data can be read or written in a single operation, or even what kind of disk is involved. In fact, the operating system itself is usually dealing with an abstraction as far as the disk is concerned - it sees an attached block device which processes requests to read or write data, and doesn’t have to worry about how that’s achieved.
If you remember our post on Data Flow, you know that a lot of data is stored in memory and some requests are going to the disk. This is the “Hardware” layer in the diagram from that article, and any data which crosses that layer means that disk operations (using IOPS) are taking place.
When you query the database for some data, the server has two options:
- Return the data from PostgreSQL’s internal cache, a section of memory set aside for shared_buffers.
- If the data isn’t in the cache, then ask the operating system to read the data from disk.
When reading data in from disk, the operating system is responsible for handling the read request and returning data to the requesting process. All modern operating systems - including all those supported by PostgreSQL - will attempt to use system memory to cache disk data in order to speed up these requests from the application’s perspective. This means that disk I/O is a much bigger factor if your working set is larger than RAM.
Input and output happens when we need to read or write from disk. You might think you can get away from needing to worry about IOPS if your entire database runs in memory, but that’s not quite the case. There are several Postgres operations that tend to use I/O which are key to point out here. These include:
- Checkpoints: outstanding changes to relation files are written to disk
- Writing WAL and related transaction control files
- Reading data into the buffer cache
- Creating or refreshing materialized views
- Manual vacuum or autovacuum: reads and potentially modifies data
- Index creation
- Temporary files generated by queries
- Prior to PG 15, database statistics
Your disk itself will have a capacity for IOPS which is part of the underlying disk. There is a finite amount of IOPS your system can handle, which is a fundamental Operating System configuration and hardware limit.
Many cloud based systems allow bursting of IOPS so that you can go beyond the base I/O allowed for certain times of day or heavy workloads. Typically bursted systems let you accrue credits across a day or week and then if your system needs to go beyond the base I/O you are able to use more I/O until you run through the burst that had been built up.
Burstable I/O can allow you to provision IOPS capacity based on your typical usage rather than your peak usage, and yet still have the capacity to burst when spikes in activity happen. This can give you better value - in some cases allowing customers to provision smaller instances on a monthly basis and achieve cost savings - but comes with a significant downside. If you don’t carefully monitor your IOPS and burst quota usage, then you may exhaust your burst capacity at which point performance will be limited to some baseline. This only ever happens while you are already bursting, so the performance impact tends to be large and can cause outages.
Even when you use disks that don’t have burst IOPS and instead provide consistent, guaranteed performance, some instance types on various cloud providers have other I/O burst capabilities or caching which can affect performance for all disk I/O. Used well, these features can provide great value, but the same caveats apply - know what limits apply to your IOPS usage, and monitor to see whether you’re approaching those limits.
IOPS can be a measure of how busy your system is but as you come closer to using your system's limits, requests can take longer to complete or even begin to be queued, which is called I/O wait. Queries become slower and end users experience latency.
Being I/O bound means that the performance of a system is limited by I/O capacity. Different application workloads have different query patterns and performance constraints, so it’s possible for your database to be CPU bound or memory bound instead. It’s important to understand which of your system resources is limiting performance, so that you don’t spend time and money upgrading to a server with more CPU or RAM when the issue was disk I/O performance all along, for example.
One of the best indicators of whether or not a system is I/O bound is when you see I/O Wait showing up in your system’s CPU metrics. I/O Wait time (often written as iowait) is the amount of time that a CPU was idle while there were pending I/O requests - that is, there was CPU capacity available for currently running processes, but those processes were waiting on a response from a disk I/O request. If this is happening frequently, it’s an indication that the disk subsystem can’t keep up with requests and so the CPU is sitting idle when it could have been doing work.
Usually you will monitor I/O Wait as part of your CPU metrics using some monitoring system external to the database. However on Linux based Postgres systems you can also use the pg_proctab extension to get access from within the database to various statistics exposed by the kernel under the /proc virtual filesystem. Note that this extension isn’t included in the default PostgreSQL distribution so you may have to install it separately. Using the pg_cputime() function provided by this extension you can find I/O wait in hundredths of a second (usually, though you can run the command getconf CLK_TCK from a shell on your server to check the exact resolution). To get a point-in-time value for what percentage of time the system is spending in I/O wait, you can run:
SELECT to_char ( iowait / (idle + "user" + system + iowait)::float * 100, '90.99%' ) AS iowait_pct FROM pg_cputime ();
This will give you a percentage figure something like this:
iowait_pct ------------ 0.07% (1 row)
Very small numbers are normal here, unless the system is under heavy load doing some sort of I/O intensive task such as running a backup or importing new data. If you’re regularly seeing I/O wait taking up even a single digit percentage of overall system time, it’s potentially an indicator that you’re exceeding the I/O capacity of your system.
PostgreSQL itself doesn’t currently have a ton of access to I/O statistics and in general this is something that needs to be tracked at the OS level. Many popular server level monitoring tools like DataDog, Prometheus/Grafana, etc, will allow you to keep an eye on these. Our own Crunchy Bridge has several monitoring tools available for this as well.
The PostgreSQL configuration setting track_io_timing controls whether the server will collect metrics on I/O performance. This is requests from Postgres to the OS, which is slightly different than actual disk I/O. track_io_timing is especially useful in combination with the BUFFERS option for the EXPLAIN command, so that you can see how much time was spent on disk I/O when a query is executed. This data is valuable for any performance tuning, however collection is disabled by default because some system configurations have a high overhead for timing calls meaning that collecting this data can negatively impact performance.
You can see whether track_io_timing is enabled on your database with the query:
track_io_timing ----------------- on (1 row)
If this isn’t enabled, then before enabling it you should use the pg_test_timing utility to check the performance impact on your system, and follow the advice in the documentation if you need to make changes to your system configuration.
Once you enable track_io_timing, I/O data gets written to:
- Explain plan buffers
If you have enabled track_io_timing as referenced above, you can get details as to what part of the query is read from buffers (the database’s block of memory) and what required IOPS to read data. If you see a reference to ‘I/O Timings’ in one of these plans, something is using I/O to execute the query.
Here’s an example of a query using tons of I/O:
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(id) FROM pages;
QUERY PLAN ---------------------------------------------- Finalize Aggregate (cost=369672.42..369672.43 rows=1 width=8) (actual time=6041.280..6044.729 rows=1 loops=1) Buffers: shared hit=12855 read=326149 dirtied=580 I/O Timings: shared/local read=15953.695 -> Gather (cost=369672.21..369672.42 rows=2 width=8) (actual time=6040.119..6044.696 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=12855 read=326149 dirtied=580 I/O Timings: shared/local read=15953.695 -> Partial Aggregate (cost=368672.21..368672.22 rows=1 width=8) (actual time=6019.362..6019.364 rows=1 loops=3) Buffers: shared hit=12855 read=326149 dirtied=580 I/O Timings: shared/local read=15953.695 -> Parallel Seq Scan on pages (cost=0.00..362738.57 rows=2373457 width=71) (actual time=2.644..5770.110 rows=1878348 loops=3) Buffers: shared hit=12855 read=326149 dirtied=580 I/O Timings: shared/local read=15953.695 Planning: Buffers: shared hit=30 dirtied=1 Planning Time: 0.216 ms JIT: Functions: 11 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 1.166 ms, Inlining 0.000 ms, Optimization 0.669 ms, Emission 19.474 ms, Total 21.309 ms Execution Time: 6067.862 ms
Here’s an EXPLAIN output on a different query, one where all the data is coming from shared buffers:
QUERY PLAN -------------------------------------------------------------------------------------------- Aggregate (cost=746.64..746.65 rows=1 width=8) (actual time=5.224..5.225 rows=1 loops=1) Buffers: shared hit=508 -> Seq Scan on nyc_streets (cost=0.00..698.91 rows=19091 width=11) (actual time=0.003..1.428 rows=19091 loops=1) Buffers: shared hit=508 Planning: Buffers: shared hit=72 Planning Time: 0.238 ms Execution Time: 5.308 ms (8 rows)
The track_io_timing will also start collecting statistics across several views including pg_stat_database, pg_stat_all_tables, pg_stat_user_tables. This data shows blocks read (used I/O) and blocks hit (data was already in shared buffers). This data is continually updated. In general you’ll be looking for user tables where the read blocks are really high compared to the to the block hit.
SELECT * FROM pg_statio_user_tables;
relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit --------+--------------------+----------------------------------------------------------+----------------+---------------+---------------+--------------+-----------------+----------------+----------------+--------------- 16716 | segment_production | tracks | 50209 | 5295312 | 1380 | 67935 | 4 | 313 | 5 | 319 16836 | segment_production | access_token_created | 25354 | 489153 | 66 | 31543 | 0 | 0 | 0 | 0 16590 | production | access_token_created | 2765 | 63595 | 2 | 318 | 0 | 0 | 0 | 0 16626 | production | api_key_created | 4 | 136 | 2 | 318 | 0 | 0 | 0 | 0
It can be helpful to convert these statistics into bytes instead of using block
units, especially when the statistics are going into full stack analysis tools.
While there is a variable block size setting that applies to some statistics,
most of PostgreSQL’s buffer cache numbers (including
EXPLAIN BUFFERS) will be
based on the database’s fixed page size of 8192.
Even with track_io_timing enabled, PostgreSQL hasn’t historically been able to provide a cohesive system-wide view of database server I/O activity. An exciting update is included in Postgres 16, released just a few weeks ago as of this post, to help with just this need. A new system view named pg_stat_io is included which provides a per cluster view of disk I/O. There have been some really great contributions in this field to provide more complete I/O statistics from inside Postgres. See Lukas Fittl’s blog for a deeper dive.
Like most system views, these statistics are cumulative and record all I/O activity since the last time statistics were reset on this server. This will look like:
SELECT * FROM pg_stat_io WHERE reads > 0 OR writes > 0;
backend_type | object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits | evictions | reuses | fsyncs | fsync_time | stats_reset --------------------+----------+----------+-------+-----------+--------+------------+------------+----------------+---------+-------------+----------+-------+-----------+--------+--------+------------+------------------------------- autovacuum worker | relation | normal | 29 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 8192 | 10468 | 0 | | 0 | 0 | 2023-09-06 14:32:36.930008-05 autovacuum worker | relation | vacuum | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8192 | 379 | 0 | 0 | | | 2023-09-06 14:32:36.930008-05 client backend | relation | bulkread | 926 | 0 | 0 | 0 | 0 | 0 | | | 8192 | 14 | 0 | 137 | | | 2023-09-06 14:32:36.930008-05 client backend | relation | normal | 105 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 8192 | 7110 | 0 | | 0 | 0 | 2023-09-06 14:32:36.930008-05 checkpointer | relation | normal | | | 1031 | 0 | 0 | 0 | | | 8192 | | | | 320 | 0 | 2023-09-06 14:32:36.930008-05 standalone backend | relation | normal | 535 | 0 | 1019 | 0 | 0 | 0 | 673 | 0 | 8192 | 88526 | 0 | | 0 | 0 | 2023-09-06 14:32:36.930008-05 standalone backend | relation | vacuum | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8192 | 918 | 0 | 0 | | | 2023-09-06 14:32:36.930008-05
Note that while the numbers in the
writes columns in this view do
correspond to individual I/O operations issued by PostgreSQL, these figures may
not match the values recorded by the storage system if you have separate metrics
for those. The operating system and even the storage layer may combine or split
I/O requests so that the actual number recorded can be different depending on
where you look. For this reason it’s important to compare numbers from the same
source when tuning, or looking at changes in activity over time.
Another really cool thing about the pg_stat_io table is that it will show the ‘context’ of the activity. So pg_stat_io will break down I/O usage into categories of things like bulk reads, bulk writes, vacuum, or normal worker activity. This is especially useful if you’re trying to figure out where your I/O spikes are coming from - for example large reads or maybe even a vacuum process.
pg_stat_io also leaves the door open for a self-starter to build out internal I/O tracking and store that over time in your own database.
module is an optional extra that’s included with PostgreSQL and then added with
CREATE EXTENSION command. When this is active, statistics are tracked for
all queries on the server and made available via the view named
pg_stat_statements. We’re big fans of this tool here at Crunchy Bridge - see
our blog post “Query optimization in Postgres with pg_stat_statements” for a
SELECT interval '1 millisecond' * total_exec_time AS "Total Exec. Time", to_char ( (total_exec_time / sum(total_exec_time) OVER ()) * 100, 'FM90D0' ) || '%' AS "Proportional Exec. Time", to_char (calls, 'FM999G999G999G990') AS "Calls", interval '1 millisecond' * (blk_read_time + blk_write_time) AS "Time Spent on IO", CASE WHEN length (query) <= 40 THEN query ELSE substr (query, 0, 39) END AS "Query", queryid AS "Query ID" FROM pg_stat_statements ORDER BY 4 DESC LIMIT 10;
Values shown in PostgreSQL statistics views are the totals since the last time statistics were reset. If the server has been running for a long time, it can be hard to make sense of these values. You may need to reset statistics when beginning an investigation into server performance, or after significant changes. To reset all server statistics, run:
If you’re using the pg_stat_statements module described above, statistics for that can be reset by running:
When looking at options for fixing a problem with high disk I/O, there are a few approaches to consider:
High disk I/O can often be an indicator of poorly tuned queries. Poorly tuned queries can cause unnecessary disk I/O in several ways:
- Sequential table scans can cause a lot of disk reads if the table isn’t cached in memory. This is especially bad for tables which are larger than memory! Often the scans can be avoided by adding an appropriate index.
- Frequent creation of temporary files can be a cause of unnecessary disk I/O. Temp files are used when queries require more data than will fit in the backend’s allowed work_mem. Use the configuration variable log_temp_files to log tables larger than a specified size, for example setting this to 10240 will log all temp files larger than 10Mb. These aren’t logged by default, but with this information in your logs you can investigate and tune the queries which are creating them or increase work_mem to use memory instead of the disk for these tables.
Another important consideration to IOPS and performance improvement is data design. Proper data design can dramatically reduce the amount of data which has to be written out to the table and to the WAL. A great example of this is a "user" table which includes a "last login" timestamp while also having several other data fields that don’t change. When you have a setup like that, every UPDATE to that "last login" timestamp will copy all of the other data (first name, last name, username, user id, first login, etc) that's on the same row and that data all gets written into the WAL too even though none of it is changing. It can be much better to have some other table tracking logins or sessions or other per-login information that naturally goes along with "last login" so that you minimize the amount of data being written as part of any transaction.
This can help because it increases the amount of RAM available for the disk cache and shared buffers. This won’t help if your application is write-heavy, but if most queries that the application issues are read queries (as is very often the case), then a larger disk cache can reduce disk I/O. If you do add more RAM, don’t forget to re-tune shared buffers and the other memory configs.
In a similar vein to increasing RAM, spinning up replicas and offloading read queries to them can help in a similar way by making more RAM and CPU and IOPS available to the overall system.
Upgrading your storage to increase IOPS capacity (rather than storage size) will improve performance and reduce I/O wait. For disk volumes on cloud providers there are often options to increase the number of IOPS allocated to a disk, or to choose a higher performance volume type. Some types of storage allocate IOPS based on the size of a volume, so sometimes increasing the size of a disk will increase the available IOPS as well. Check your storage providers documentation for further details.
If you’re not already using the latest PostgreSQL version, consider upgrading. Every version comes with performance improvements.
After reading the article you should have a better understanding of why disk performance matters to PostgreSQL, how to see if your disk I/O is causing problems, and where to start if you’re having issues.
For urgent performance issues, I tend to recommend upgrading first and tuning once you have time after everything is running smoothly again. However with experience reading EXPLAIN plans you’ll find that a quick look at the most recent slowest queries might point to a quick fix - for example, perhaps an application introduced a new query which is missing an index, or a bad query plan which ignores a suitable index might suggest a need to run an ANALYZE to update table statistics.
It goes without saying here that IOPS isn't the only metric to look at: throughput, the type of workload, and lots of other factors are at play. That said, storage performance is a topic that many people only start learning when they encounter a problem, and knowing that it’s something to look for is half the battle!
Crunchy Bridge offers competitive IOPS, both with the base machine and with burst capability. Our provision page as well as our docs page explains the amount of burstable i/o. Burstable I/O can allow customers with certain workloads to provision smaller instances but still keep up with higher burst workloads. In some cases this allows customers to provision smaller instances and maximizing price to performance.
September 27, 2023 •More by this author