In my career I've worked on teams that ran and managed over a million databases, and I've helped shard and scale database for customers at 100s of terabytes.
Postgres is an incredibly flexible database, and because of it's flexibility you have a lot of options for architecting your data design for your application needs. In talking with current customers and future customers how you design and architect your database is a common topic. We're going to dig into some of the flexibility on architecting your database and terminology within Postgres and the trade-offs for different approaches.
Starting with an instance
For simplicity sake we're going to assume an instance, whether EC2 or in your own data center is how you're running Postgres. Yes, you have plenty of options to carve an instance up into smaller resources using LXC or cgroups, or you can fully containerize and run Kubernetes to further divide up your resources. If you're in that world and using our Kubernetes Operator you can for the sake of the rest of this post assume a container is the equivalent of an instance.
A running Postgres is a cluster
Under the covers this is essentially the running Postgres process. Sometimes you may refer to your running "database", but as you'll see later on calling it a database may lead to potential confusion. The running Postgres typically has a base backup and then associate write-ahead-log (WAL) that will be archived for it with a tool like pgBackRest. The backup and WAL contain everything for your Postgres cluster, it's all grouped together and not separated out by default. We covered streaming replication in our other post on logical replication and the difference between the two.
Within your Postgres cluster is a database
When you start to work with your database you never connect to a "cluster" but rather to a specific database. Let's break it down a little from a Postgres connection string:
If we break apart the connection string.
craig- is the username
PyLpvXPJUgzUBteEAKY- is the password
p.vcrwzi5u5beilpsgyy.db.postgresbridge.com- is the hostname
mydatabase- is the database within the cluster you're connecting to
5432- is the port it is connecting on
The database portion can be
postgres which is the default Postgres
database. If you connect to that database with the superuser you can also create
new databases. Databases cannot be easily joined between each other (I say
easily because you can use the Postgres foreign data wrapper as means of
connecting between them). When you do a
pg_dump against a Postgres database it
is only that database, where-as the base backup and WAL is for all of the
databases contained within a cluster.
Once you're connected to your specific database then you start designing your schema.
Inside a Postgres database
The term schema, similar to database, can be equally confusing. Most of the time
when you talk about schema you mean the tables and columns you create inside
your database. Postgres also has a notion of schemas, schemas are logically
separated by a namespace within the database. By default you're mostly working
public schema which you get by default, but be careful if you’re on
PG15 as there are some changes with
Inside Postgres you have what is known as a search_path:
show search_path; search_path ----------------- "$user", public (1 row)
You can update your search path to search across multiple schemas, it will
search them in order for matching table names. You can also execute fully
qualified queries that include the schema name. The below queries with a table
orders in your
public schema and the default search path are effectively the
exact same query:
SELECT * FROM orders; SELECT * FROM public.orders;
Clusters, databases, schemas, so what?
So we’ve told you a bunch of information about clusters, databases, and schemas, but you’re wondering how it fits in practically.
Clusters are the full instance that is running Postgres itself. You’re not getting away from this one, but know that base backups and WAL from a cluster is all grouped together. Thus if you’re needing to do something with smaller sets of backups and querying you’ll want to dig into how you separate things in to separate databases or schemas. As a broad rule 90% of applications have a cluster with a single database and single schema within them.
Databases are useful when entirely isolated. If you have multiple different applications, but a small number of them so you’re not worried about scaling connections to each of them, then they’re a great method of separating your data access. For example, you could have a database for your blog, one for your events calendar, and one for your CRM application. All these are quite distinct and don’t interact with each other.
Schemas can be useful for separating data that may otherwise follow the same table structure. In most cases you may not want to query across the full set of schemas, but for out of band reporting it could be useful. One of the most common practical applications of schemas is multi-tenant applications where you want to create a schema for user1, user2, user3. All the tables within those could look the same, but if you want analytical queries across your users you can easily do that.
Hopefully this gives you a good primer on something that you’ll often hear/see talked about (databases and schemas) in Postgres, both of which can be very overloaded to talk about a generic term of something very technical Postgres specific which can have very different implications.
October 27, 2022 •More by this author