Is your Postgres ready for production?

Is your database ready for production?

You've been building your application for months, you've tested with beta users, you've gotten feedback and iterated. You've gone through your launch checklist, email beta users, publish the blog post, post to hacker news and hope the comments are friendly. But is your database ready for whatever may come on launch day or even 2 months in? Here's a handy checklist to make sure you're not caught flat footed.

Backups❓
High availability❓
Logs properly configured❓
- Archived and persisted❓
- Logging slow queries❓
- Auto explain❓
Statement timeout❓
Connection pooling❓

Let’s drill in further to some of these key items.

Backups

Postgres has a few different mechanisms for backups.

Logical backups
Physical backups.

A logical backup is the raw SQL data that can be loaded in a form of INSERTS or similar. Logical backups are great if you want to move data around between environments or need a local copy.

Physical backups are a combination of a base backup (how the bytes look on disk) and the write-ahead-log or WAL. In over simplified terms Postgres is really just a large append only log under the covers known as WAL. By combining the base backup with WAL you’re able to have database backup that can be for a range in time vs. a single snapshot like with pg_dump.

Both logical and physical have their time in place, but a key piece of both is that they should be tested. A backup system that isn’t tested has a good chance of not being a working backup.

High-availability

One might think this one needs no explanation, or that if you have backups you’re fine. However we talk to customers all the time that struggle making the decision if they need HA. With no HA if there is a failure to your database that means you have to recover from a backup. If you’re using logical backups you could lose data between the time of your backup and the time your database went down.

If you have a good setup with physical backups and you’re using something like pgBackRest for continuous protection your data is safe. But you still may have significant downtime as you work to recover your database. A rough rule of thumb is you should plan for 1 hr of downtime per 200GB of database size.

Enter HA, with HA you have a streaming replica that in real-time (either synchronous or asynchronous) receives the transactions from a primary that is ready to be failed over to. Using something like disk replication could result in unexpected failover times due to Postgres having to go through crash recovery. If uptime is critical, a good HA setup is your friend.

Logging

When it comes time to investigate performance issues, logs can be a key source of information.

For the basics, the first step is making sure you have good log retention. You want some goldilocks zone, you don’t want to log every query that runs against your database. But you don’t want only queries that run for hours to be logged. You don’t need to retain logs for months and months, usually a few weeks of retention is sufficient, but this really does come down to your business needs.

We generally recommend leveraging a third party logging service or tool vs. self managing all of this, such as Mezmo. Then, send all your logs from application, background processes, and other services in a single place.

You’ve got retention, but what actually goes into the logs? The couple of big ones are:

Auto-log slow queries
Include the explain plan of slow queries in logs

Care about compliance and auditing of who accesses your database and how? Take advantage of pg_audit (already built in to Crunchy Bridge) to audit queries from any of your non-application users.

You can dig into each of those more here.

Don't let one bad query grind production to a halt

Are you scared of the following running?

SELECT *
FROM events;

Most of the time, these particular issues happen when analysis is being run against a production database. At small companies, it is common for the application database to also be the warehouse and also feed the CRM. If you choose to run this way, use statement_timeout.

If you are a larger, more sophisticated company (who has already experienced the pain of letting people connect to the production database), run analysis on read-only replicas or ETL your data into a data warehouse.

If your application, or your customer base is generating the bad queries and not sure where to begin, take a deeper read on how to control runaway queries.

Connection pooling

In a serverless world, if your application is slow, what do you do? Run more application runners!

Not so fast. Postgres has connection limits because each connection has a dedicated set of memory for cursor-level database operations, like sorting and writing transactions. If Postgres allowed too many connections, all of the memory would be dedicated to serving connections instead of indexes (indexes love RAM). Of note, Postgres has greatly improved on connection management in recent releases. The old rules of not using more than 500 connections don’t apply the same as 3 years ago, but bounds and limits still exist here.

Thus, when you start scaling out your application, it’s best to have connection pooling between your application and your database. Crunchy Bridge uses PgBouncer for this connection pooling — and it’s already set up and ready for you to use. Check out the documentation here. It’s easy enough to start using it at launch, then you won’t have to switch to using it later.

Final thoughts

Your Postgres can be ready for production in just a few steps.

Backups ✅
High availability ✅
Logs ✅
- Archived ✅
- Logging slow queries ✅
- Auto explain ✅
Statement timeout set ✅
Connection pooling enabled ✅

My 2 cents - find a vendor that makes your checklist as easy as possible.

Latest Articles