I've always found distributed systems to be the most fascinating branch of computer science. I think the reason is that distributed systems are subject to the rules of the physical world just like we are. Things are never perfect, you cannot get everything you want, you’re always limited by physics, and often by economics, or by who you can communicate with. Many problems in distributed systems simply do not have a clean solution, instead there are different trade-offs you can make.
While at Citus Data, Microsoft, and now Crunchy Data, the focus of my work has been on distributed PostgreSQL architectures. At the last PGConf.EU in December, I gave a talk titled “PostgreSQL Distributed: Architectures & Best Practices” where I went over various kinds of distributed PostgreSQL architectures that I’ve encountered over the years.
Many distributed database discussions focus on algorithms for distributed query planning, transactions, etc. These are very interesting topics, but the truth is that only a small part of my time as a distributed database engineer goes into algorithms, and an excessive amount of time goes into making very careful trade-offs at every level (and of course, failure handling, testing, fixing bugs). Similarly, what many users notice within the first few minutes of using a distributed database is how unexpectedly slow they can be, because you quickly start hitting performance trade-offs.
There are many types of distributed PostgreSQL architectures, and they each make a different set of trade-offs. Let’s go over some of these architectures.
To set the stage for discussing distributed PostgreSQL architectures, we first need to understand a bit about the simplest possible architecture: running PostgreSQL on a single machine, or "node".
PostgreSQL on a single machine can be incredibly fast. There’s virtually no network latency on the database layer and you can even co-locate your application server. Millions of IOPS are available depending on the machine configuration. Disk latency is measured in microseconds. In general, running PostgreSQL on a single machine is a performant and cost-efficient choice.
So why doesn’t everyone just use a single machine?
Many companies do. However, PostgreSQL on a single machine comes with operational hazards. If the machine fails, there’s inevitably some kind of downtime. If the disk fails, you’re likely facing some data loss. An overloaded system can be difficult to scale. And you’re limited to the storage size of a disk, which when full will cease to process and store data. That very low latency and efficiency clearly comes at a price.
Distributed PostgreSQL architectures are ultimately trying to address the operational hazards of a single machine in different ways. In doing so, they do lose some of its efficiency, and especially the low latency.
The goal of a distributed database architecture is to try to meet the availability, durability, performance, regulatory, and scale requirements of large organizations, subject the physics. The ultimate goal is to do so with the same rich functionality and precise transactional semantics as a single node RDBMS.
There are several mechanisms that distributed database systems employ to achieve this, namely:
- Replication - Place copies of data on different machines
- Distribution - Place partitions of data on different machines
- Decentralization - Place different DBMS activities on different machines
In practice, each of these mechanisms inherently comes with concessions in terms of performance, transactional semantics, functionality, and/or operational complexity.
To get a nice thing, you’ll have to give up a nice thing, but there are many different combinations of what you can get and what you need to give up.
Of course, distributed systems have already taken over the world, and most of the time we don’t really need to worry a lot about trade-offs when using them. Why would distributed database systems be any different?
The difference lies in a combination of storing the authoritative state for the application, the rich functionality that an RDBMS like PostgreSQL offers, and the relatively high impact of latency on client-perceived performance in OLTP systems.
PostgreSQL, like most other RDBMSs, uses a synchronous, interactive protocol where transactions are performed step-by-step. The client waits for the database to answer before sending the next command, and the next command might depend on the answer to the previous.
Any network latency between client and database server will already be a noticeable factor in the overall duration of a transaction. When PostgreSQL itself is a distributed system that makes internal network round trips (e.g. while waiting for WAL commit), the duration can get many times higher.
Why is it bad for transactions to take longer? Surely humans won’t notice if they need to wait 10-20ms? Well, if transactions take on average 20ms, then a single (interactive) session can only do 50 transactions per second. You then need a lot of concurrent sessions to actually achieve high throughput.
Having many sessions is not always practical from the application point-of-view, and each session uses significant resources like memory on the database server. Most PostgreSQL set ups limit the maximum number of sessions in the hundreds or low thousands, which puts a hard limit on achievable transaction throughput when network latency is involved. In addition, any operation that is holding locks while waiting for network round trips is also going to affect the achievable concurrency.
While in theory, latency does not have to affect performance so much, in practice it almost always does. The CIDR ‘23 paper “Is Scalable OLTP in the Cloud a solved problem?” gives a nice discussion of the issue of latency in section 2.5.
PostgreSQL can be distributed at many different layers that hook into different parts of its own architecture and make different trade-offs. In the following sections, we will discuss these well-known architectures:
- Network-attached block storage (e.g. EBS)
- Read replicas
- DBMS-optimized cloud storage (e.g. Aurora)
- Active-active (e.g. BDR)
- Transparent Sharding (e.g. Citus)
- Distributed key-value stores with SQL (e.g. Yugabyte)
We will describe the pros and cons of each architecture, relative to running PostgreSQL on a single machine.
Note that many of these architectures are orthogonal. For instance, you could have a sharded system with read replicas using network-attached storage, or an active-active system that uses DBMS-optimized cloud storage.
Network-attached block storage is a common technique in cloud-based architectures where the database files are stored on a different device. The database server typically runs in a virtual machine in a Hypervisor, which exposes a block device to the VM. Any reads and writes to the block device will result in network calls to a block storage API. The block storage service internally replicates the writes to 2-3 storage nodes.
Practically all managed PostgreSQL services use network-attached block devices because the benefits are critical to most organizations. The internal replication results in high durability and also allows the block storage service to remain available when a storage node fails. The data is stored separately from the database server, which means the database server can easily be respawned on a different machine in case of failure, or when scaling up/down. Finally, the disk itself is easily resizable and supports snapshots for fast backups and creating replicas.
Getting so many nice things does come at a significant performance cost. Where modern Nvme drives generally achieve over >1M IOPS and disk latency in the tens of microseconds, network-attached storage is often below 10K IOPS and >1ms disk latency, especially for writes. That is a ~2 order of magnitude difference.
- Higher durability (replication)
- Higher uptime (replace VM, reattach)
- Fast backups and replica creation (snapshots)
- Disk is resizable
- Higher disk latency (~20μs -> ~1000μs)
- Lower IOPS (~1M -> ~10k IOPS)
- Crash recovery on restart takes time
- Cost can be high
💡 Guideline: the durability and availability benefits of network-attached storage usually outweigh the performance downsides, but it’s worth keeping in mind that PostgreSQL can be much faster.
PostgreSQL has built-in support for physical replication to read-only replicas. The most common way of using a replica is to set it up as a hot standby that takes over when the primary fails in a high availability set up. There are many blogs, books, and talks describing the trade-offs of high availability set ups, so in this post I will focus on other architectures.
Another common use for read replicas is to help you scale read throughput when reads are CPU or I/O bottlenecked by load balancing queries across replicas, which achieves linear scalability of reads and also offloads the primary, which speeds up writes!
A challenge with read replicas is that there is no prescribed way of using them. You have to decide on the topology and how you query them, and in doing so you will be making distributed systems trade-offs yourself.
The primary usually does not wait for replication when committing a write, which means read replicas are always slightly behind. That can become an issue when your application does a read that, from the user’s perspective, depends on a write that happened earlier. For example, a user clicks “Add to cart”, which adds the item to the shopping cart and immediately sends the user to the shopping cart page. If reading the shopping cart contents happens on the read replica, the shopping cart might then appear empty. Hence, you need to be very careful about which reads use a read replica.
Even if reads do not directly depend on a preceding write, at least from the client perspective, there may still be strange time travel anomalies. When load balancing between different nodes, clients might repeatedly get connected to different replica and see a different state of the database. As distributed systems engineers, we say that there is no “monotonic read consistency”.
Another issue with read replicas is that, when queries are load balanced randomly, they will each have similar cache contents. While that is great when there are certain extremely hot queries, it becomes painful when the frequently read data (working set) no longer fits in memory and each read replica will be performing a lot of redundant I/O. In contrast, a sharded architecture would divide the data over the memory and avoid I/O.
Read replicas are a powerful tool for scaling reads, but you should consider whether your workload is really appropriate for it.
- Read throughput scales linearly
- Low latency stale reads if read replica is closer than primary
- Lower load on primary
- Eventual read-your-writes consistency
- No monotonic read consistency
- Poor cache usage
💡 Guideline: Consider using read replicas when you need >100k reads/sec or observe a CPU bottleneck due to reads, best avoided for dependent transactions and large working sets.
There are a number of cloud services now like Aurora and AlloyDB that provide a network-attached storage layer that is optimized specifically for a DBMS.
In particular, a DBMS normally performs every write in two different ways: Immediately to the write-ahead log (WAL), and in the background to a data page (or several pages, when indexes are involved). Normally, PostgreSQL performs both of these writes, but in the DBMS-optimized storage architecture the background pages writes are performed by the storage layer instead, based on the incoming WAL. This reduces the amount of write I/O on the primary node.
The WAL is typically replicated directly from the primary node to several availability zones to parallelize the network round trips, which increases I/O again. Always writing to multiple availability zones also increases the write latency, which can result in lower per-session performance. In addition, read latency can be higher because the storage layer does not always materialize pages in memory. Architecturally, PostgreSQL is also not optimized for these storage characteristics.
While the theory behind DBMS-optimized storage is sound. In practice, the performance benefits are often not very pronounced (and can be negative), and the cost can be much higher than regular network-attached block storage. It does offer a greater degree of flexibility to the cloud service provider, for instance in terms of attach/detach times, because storage is controlled in the data plane rather than the hypervisor.
- Potential performance benefits by avoiding page writes from primary
- Replicas can reuse storage, incl. hot standby
- Can do faster reattach, branching than network-attached storage
- Write latency is high by default
- High cost / pricing
- PostgreSQL is not designed for it, not OSS
💡 Guideline: Can be beneficial for complex workloads, but important to measure whether price-performance under load is actually better than using a bigger machine.
In the active-active architecture any node can locally accept writes without coordination with other nodes. It is typically used with replicas in multiple sites, each of which will then see low read and write latency, and can survive failure of other sites. These benefits are phenomenal, but of course come with a significant downside.
First, you have the typical eventual consistency downsides of read replicas. However, the main challenge with an active-active setup is that update conflicts are not resolved upfront. Normally, if two concurrent transactions try to update the same row in PostgreSQL, the first one will take a “row-level lock”. In case of active-active, both updates might be accepted concurrently.
For instance, when you perform two simultaneous updates of a counter on different nodes, the nodes might both see 4 as the current value and set the new value to 5. When replication happens, they’ll happily agree that the new value is 5 even though there were two increment operations.
Active-active systems do not have a linear history, even at the row level, which makes them very hard to program against. However, if you are very prepared to live with that, the benefits could be attractive especially for very high availability.
- Very high read and write availability
- Low read and write latency
- Read throughput scales linearly
- Eventual read-your-writes consistency
- No monotonic read consistency
- No linear history (updates might conflict after commit)
💡 General guideline: Consider only for very simple workloads (e.g. queues) and only if you really need the benefits.
Transparent sharding systems like Citus distribute tables by a shard key and/or replicate tables across multiple primary nodes. Each node shows the distributed tables as if they were regular PostgreSQL tables and queries & transactions are transparently routed or parallelized across nodes.
Data is stored in shards, which are regular PostgreSQL tables, which can take advantage of indexes, constraints, etc. In addition, the shards can be co-located by the shard key (in “shard groups”), such that joins and foreign keys that include the shard key can be performed locally.
The advantage of distributing the data this way is that you can take advantage of the memory, IO bandwidth, storage, and CPU of all the nodes in an efficient manner. You could even ensure that your data or at least your working set always fits in memory by scaling out.
Scaling out transactional workloads is most effective when queries have a filter on the shard key, such that they can be routed to a single shard group (e.g. single tenant in a multi-tenant app). That way, there is only a marginal amount of overhead compared to running a query on a single server, but you have a lot more capacity. Another effective way of scaling out is when you have compute-heavy analytical queries that can be parallelized across the shards (e.g. time series / IoT).
However, there is also higher latency, which reduces the per-session throughput compared to a single machine. And, if you have a simple lookup that does not have a shard key filter, you will still experience all the overhead of parallelizing the query across nodes. Finally, there may be restrictions in terms of data model (e.g. unique and foreign constraints must include shard key), SQL (non-co-located correlated subqueries), and transactional guarantees (snapshot isolation only at shard level).
Using a sharded system often means that you will need to adjust your application to deal with higher latency and a more rigid data model. For instance, if you are building a multi-tenant application you will need to add tenant ID columns to all your tables to use as a shard key, and if you are currently loading data using INSERT statements then you might want to switch to COPY to avoid waiting for every row.
If you are willing to adjust your application, sharding can be one of the most powerful tools in your arsenal for dealing with data-intensive applications.
- Scale throughput for reads & writes (CPU & IOPS)
- Scale memory for large working sets
- Parallelize analytical queries, batch operations
- High read and write latency
- Data model decisions have high impact on performance
- Snapshot isolation concessions
💡 General guideline: Use for multi-tenant apps, otherwise use for large working set (>100GB) or compute heavy queries.
About a decade ago, Google Spanner introduced the notion of a distributed key-value store that supports transactions across nodes (key ranges) with snapshot isolation in a scalable manner by using globally synchronized clocks. Subsequent evolutions of Spanner then added a SQL layer on top, and ultimately even a PostgreSQL interface. Open source alternatives like CockroachDB and Yugabyte followed a similar approach without the requirement of synchronized clocks, at the cost of significantly higher latency.
These systems have built on top of existing key-value storage techniques for availability and scalability, such as shard-level replication and failover using Paxos or Raft. Tables are then stored in the key-value store, with the key being a combination of the table ID and the primary key. The SQL engine is adjusted accordingly, distributing queries where possible.
In my view, the relational data model (or, your typical PostgreSQL app) is not well-served by using a distributed key-value store underneath. Related tables and indexes are not necessarily stored together, meaning typical operations such as joins and evaluating foreign keys or even simple index lookups might incur an excessive number of internal network hops. The relatively strong transactional guarantees that involve additional locks and coordination can also become a drag on performance.
In comparison to PostgreSQL or Citus, performance and efficiency are often disappointing. However, these systems offer much richer (PostgreSQL-like) functionality than existing key-value stores, and better scalability than consensus stores like etcd, so they can be a great alternative for those.
- Good read and write availability (shard-level failover)
- Single table, single key operations scale well
- No additional data modeling steps or snapshot isolation concessions
- Many internal operations incur high latency
- No local joins in current implementations
- Not actually PostgreSQL, and less mature and optimized
💡 General guideline: Just use PostgreSQL 😉 For simple applications, the availability and scalability benefits can be useful.
PostgreSQL can be distributed at different layers. Each architecture can introduce severe trade-offs. Almost nothing comes for free.
When deciding on the database architecture, keep asking yourself:
- What do I really want?
- Which architecture achieves that?
- What are the downsides?
- What can my application tolerate? (can I change my application?)
Even with state-of-the-art tools, deploying a distributed database system is never a solved problem, and perhaps never will be. You will need to spend some time understanding the trade-offs. I hope this blog post will help.
If you’re still feeling a bit lost, our PostgreSQL experts at Crunchy Data will be happy to help you pick the right architecture for your application.
January 8, 2024 •More by this author