Designing Your Postgres Database for Multi-tenancy
If you're building a B2B app chances are it's multi-tenant, meaning one customer data is separated and doesn't intermingle with other customer data. When building the app itself you'll typically have some URL route defining the specific tenant. When it comes to your database there are a number of design patterns that you can use that offer different advantages, trade-offs, and scaling opportunities.
When you begin building your app, time to market is essential, but you don't want to make decisions that are hard to reverse. We're first going to take a look at the most common design patterns for multi-tenant database design in Postgres, followed by some guidance on how to design your data model so that you're prepared from day 1 for massive scale (should you need it).
Three approaches to multi-tenant database design
One database per customer
The first option is simply to have a separate database per customer. In this case you could have complete copies of your web application running for them as well. This gives you the maximum amount of isolation, you can encrypt your databases with different keys, scale them independently of each other, upgrade them at different times, leave old versions of your app and migrations on one while still updating other customers.
If this all sounds great - slow down and beware. Maintaining different versions of your database and different versions of your schema, figuring out how to run migrations across multiple different databases (a seemingly solved problem for most frameworks when dealing with a single database) can become a significant pain at larger scale.
What is a larger scale? Managing 5 databases is fine, managing 10 you're probably okay, but if you anticipate 50 customers or more steer clear from this approach.
Schema per customer
The second approach is to use schemas. In many ways this approach is not dissimilar from the one database per customer. But there are both some pros and cons of this approach...
Pros:
- Can easily join across customers for cross tenant analytics
- Customers are fully isolated in schemas
Cons:
- Have to manage migrations across all schemas
Now there is an interesting caveat with this approach, which is that Citus 12 recently introduced not just support for schema based sharding, but special integration with PgBouncer. Citus first level of support to use schemas was great, but without connection scaling via PgBouncer less practical. Now schema based sharding with Citus and PgBouncer this has become a very viable approach with less tradeoffs.
Hold on! What is Citus?
Citus is an extension to Postgres that allows the sharding of data to be transparent to your application. For some word soup, Citus turns Postgres into a sharded, distributed, horizontally scalable database. It's very much designed for sharding with two real key use cases: HTAP/real-time analytics and multi-tenant sharding. It is not a geo-distributed database, but more built for large scale and high performance in a single data center or region. We'll come back to Citus and see how it can be great for some of these paths, but first let's complete our look at approaches.
Tenant discriminator in shared tables
With this approach you're using a tenant or customer id on your tables. At some level this is a common setup even when you're not thinking about it because if you have a multi-tenant app then you have a customer table that relates to other tables and you join those models together. A nuance of this approach is it's actually a good step to denormalize and have your customer or tenant id on every table. Database purists may tell you this is a bad idea, but it makes future scaling and safety of data a bit easier. Once you have your customer id on every table you want to ensure you're joining on that key.
There are a number of libraries that help with the enforcement of joining on your customer id (Rails acts_as_tenant or activerecord-mulit-tenant and Django multi-tenant). While this approach will intermingle data it can make scaling our much easier and reduce your maintenance overhead on operational activities of your Postgres database.
To Recap the approaches
Approach | Scale | Cons |
---|---|---|
Database per customer | 10s of tenants | High overhead of management, harder to scale |
Database per schema | 100s of tenants | Overhead on schema migrations |
Tenant discriminator | millions of tenants | Lack of tenant isolation |
Enter Citus
We talked a little bit above about Citus, what's it got to do with multi-tenant database design? At its core Citus is an extension that helps you scale out Postgres from a single node to multiple nodes. If you have a database that will need massive scale this is where Citus comes in. Many folks when they think distributed databases think geo-distributed, this is not where Citus shines. Citus gives linear scalability beyond a single node if your workload adapts well to Citus.
Within Citus when you shard your data on a tenant or customer id all the data gets co-located on the same instance. This means when you join you’re not doing cross shard joins, the join is pushed down to the node where all the data is located. Citus has long worked well for multi-tenant workloads because of this, it even has some special purpose features such as citus_stat_statements.
Citus_stat_statements build on one of the most valuable Postgres extensions pg_stat_statements, by preserving the shard key on the query. This means for a multi-tenant app you can see which tenants are executing which queries. Want to know your tenants that are putting the most load against your database, now you have it.
Citus has the ability to identify which tenants live on which nodes:
SELECT
shardid,
shardstate,
shardlength,
nodename,
nodeport,
placementid
FROM
pg_dist_placement AS placement,
pg_dist_node AS node
WHERE
placement.groupid = node.groupid
AND node.noderole = 'primary'
AND shardid = (
SELECT
get_shard_id_for_distribution_column ('opportunities', 62)
);
And from there you could actually move tenants around to different nodes or even fully isolate them: SELECT isolate_tenant_to_new_shard('table_name', tenant_id);
Citus and schema based sharding
While Citus has long embraced the tenant discriminator approach with Citus 12 there is new support for schema based sharding. Coupled with the latest release of PgBouncer, you also have better connection management and scaling without having to give up schema based sharding. While schema based sharding is available in Citus 12 it’s still generally recommended for a more medium level of tenants, generally not to exceed several thousand of tenants and thus corresponding schemas. If you’re anticipating millions of tenants the tenant discriminator approach is still recommended.
If we revisit our summary of approaches now with Citus:
Approach | Scale | Cons |
---|---|---|
Database per customer | 10s of tenants | High overhead of management, harder to scale |
Database per schema | 100s of tenants | Overhead on schema migrations |
Tenant discriminator | millions of tenants | Lack of tenant isolation |
Citus | millions of tenants | Must design application for Citus, but once in place easier management and scaling |