At Crunchy we talk a lot about memory, shared buffers, and cache hit ratios. Even our new playground tutorials can help users learn about memory usage. The gist of many of those conversations is that you want to have most of your frequently accessed data in the memory pool closest to the database, the shared buffer cache.
There's a lot more to the data flow of an application using Postgres than that. There could be application-level poolers and Redis caches in front of the database. Even on the database server, data exists at multiple layers, including the kernel and various on-disk caches. So for those of you that like to know the whole story, this post pulls together the full data flow for Postgres reads and writes, stem-to-stern.
The application server sends queries to the individual PostgreSQL backend and gets the result set back. However there may in fact be multiple data layers at play here.
Application caching can have many layers/places:
- Browser-level caching: a previously-requested resource can be re-used by the client without needing to request a new copy from the application server.
- Reverse proxy caches, i.e. Cloudflare, Nginx: a resource is requested by a user, but does not even need to hit an application server to return the result.
- Individual per-worker process caches: Within specific application code, each backend could store some state to reduce querying against the database.
- Framework-specific results or fragment caching: Whole parts of resources could be stored and returned piecemeal, or entire database result sets could be stored locally or in a shared resource outside of the database itself to obviate the need for accessing the database at all. This could be something like Redis or memcached to name a few examples.
When the application requests data that is not cached with one of the above methods, the application initiates an upstream connection to the database. Rather than always connecting directly, many application frameworks support application-level pooling. Application pooling allows multiple workers to share some smaller number of database connections among them. This reduces the resources like memory needed. At the same time, reusing open connections decreases the average time spent creating new database connections.
Once we reach the level of the database connection, we can see some of the ways that data flows there. Connections to the database may be direct or through a database pooler.
Similar to the application-level pooling, a database pooler can be placed between the incoming database connections and the PostgreSQL server backends. pgBouncer is the de facto connection pooling tool. A connection pooler allows requests to share database resources among others with similar connection requirements. This also ensures that you are using fewer connections more efficiently, rather than having many idle connections.
The database pooler acts as a proxy of sorts, intermixing client requests with a smaller number of upstream PostgreSQL connections.
When a connection is made to the PostgreSQL postmaster, a client backends is launched to communicate with it. This individual backend services all queries for a specific connection and returns the result sets. The client backend does this by coordinating access to table or index data through use of the shared_buffers memory segment. This is the point at which data requested and returned stops being "logical" requests and drills down to the filesystem.
When a query requires data from a specific table, it will first check shared_buffers to see if the target block already exists there. If not, it will read the block into shared_buffers from the disk IO system. Buffers are a shared resource that all PostgreSQL backends use. When a disk block is loaded for one backend, later queries requesting it will find it’s already loaded in memory.
If a query changes data in a table, it must first load the data page into shared_buffers (if it is not already loaded). The change is then made on the shared memory page, modified disk blocks written to the Write Ahead Log (assuming we are a LOGGED relation), and the page is marked dirty. Once the WAL page has been successfully written to disk at COMMIT time the transaction is safe on disk.
The block changes of dirty pages are written out asynchronously, with the eventual writer then marking it clean in shared_buffers. Possible writers include other (or the same) client(s), the database's Background Writer, and the system CHECKPOINT process. When multiple changes are made to the same disk pages in a short period, with enough memory this design enables an accelerated write path. Only a delta of additional WAL needs to be written each time the dirty page changes. Ideally the full content of the block is written to disk just once: during the next checkpoint.
If Postgres needs to load additional pages to answer a query and shared_buffers is full, it will pick a page that is currently unused and evict it. Even though this page is not now in shared_buffers it may still be in the filesystem cache from the original disk read.
In Linux, memory not in active use by programs caches recently used disk pages. This transparently accelerates workloads where data is re-read. Keeping the page in memory means we do not need to read it from the disk, a relatively slow process, if another client asks for it. Indexes are the classic example of a frequently re-read database hot spot.
Cached memory is available when needed for other parts of the system, so it doesn’t prevent programs from requesting additional memory. If this happens, the kernel will just drop some number of buffers from the OS cache for the kernel to fulfill the memory request.
For read buffers, there is no issue with dumping the contents of memory here;
worst case it will just reload the original data from the disk. When the WAL or
disk block changes are written, PostgreSQL waits for the write to complete via
the appropriate system kernel call, i.e.
fsync(). That ensures that the
contents of the changed disk buffers have made it to the hardware I/O controller
and potentially further.
Once you’ve made it to the I/O layer you might assume you’d be done with caching, but caching is everywhere. Most disks have an internal I/O cache for reads/writes which will buffer and reorder I/O access so the device can manage optimal access/throughput.
If you read a disk block from the operating system, the internal disk cache layer will likely read-ahead surrounding blocks to have them in the internal disk cache, available for subsequent reads. When you write to disk, even if you fsync, the drive itself may have a caching layer (be it battery-backed controller, SSD, or NVMe cache) that will buffer these writes, then flush out to physical storage at some point in the near-immediate future.
Congratulations, if you got this far then your disk writes have actually been saved on the underlying medium. These days that’s some form of SSD or NVMe storage. At this layer, the hardware disk cache durably writes data changes to disk and reads data from block addresses. This is generally considered the lowest level of our data layers.
Internally SSD and NVMe hardware can have their own caches (yes, even more!) below where the database operates. Examples include a DRAM metadata cache for flash mapping tables and/or a write cache using faster SLC flash cells.
.....and the diagram you've been scrolling for
Feel like you just took a trip to the center of the earth? Data flow from Postgres involves all of these parts to get you the most used data the fastest:
- Possible Application Pooler
- Individual Client Backend (Postgres connection)
- Shared Buffers
- File System Cache
- Disk Cache
- Physical Disk Storage
September 19, 2022 •More by this author