Today, we wanted to address some basic principles for better managing data architecture. Postgres is well regarded as a database for traditional system of record. More recently we've been fielding questions on what else can it do, such as: Can it be good for analytics and metrics? The short answer is "yes". When applications expand outside their standard system of record, they add in new types of data and data stores, which introduces complexity managing multiple types of systems.
Some common workloads for Postgres are:
- Primary system of record
- Metrics / analytics data
- Log / event data
By knowing the toolset, Postgres gracefully handles versatile workloads. We've talked a lot about how flexible it is in terms of features. By combining features, and the database is equally flexible for multiple use cases.
We use Postgres to power daily performance metrics system for Crunchy Bridge. Postgres works quite well for this, as I laid in the Twitter-thread above. The details of the Crunchy Bridge's metrics system are:
- Ingesting 50M events daily, without breaking much of a sweat
- Currently a basic metrics schema with Postgres partitioning, and a few key indexes
- In the future we'll do aggregations that leverage pg_cron and (possibly) hyperloglog
- By running the metrics data on a different cluster, differing workloads are not competing for the same resources
- Because it's all "just Postgres", we connect metrics into our system of record with a foreign data wrapper
Using simple, built-in tools avoids data premature optimizations, but knowing the next steps allows us focus on continued investment in features that make a positive impact for our customers.
Let's look at the three workloads we call out above to see what's unique / different about them.
A "system of record" is the typical design for the specifics of your application. If you are building a CRM, the data consists of accounts, contacts, and opportunities. If, like Crunchy Data, you are building a DBaaS, the data tables of accounts, teams, clusters, and networks. The characteristics of system of record is:
- Smaller working set of data that often can fit in memory
- Support small, fast queries, with single digit millisecond or less
- Ensure data consistency using primary key, foreign keys, and data validity constraints
With metrics or analytics systems you're going to ingest a larger amount of data. Typical characteristics of these systems are:
- Consistently growing write volume
- Larger amount of data is unlikely to fit entirely in memory
- Customer interactions require responsive queries
- Older data is not read as frequently as new data
For systems that have time to mature, large ingest systems have a few stages:
- Start with raw inserts
- Move to multi-row inserts
- Micro-batch with copy
Beyond the ingestion, querying raw-data over time will become slower and slower as the data grows. To keep the read queries performant, use:
- Partitioning so you don't have to always query the entire dataset
- Limited number of targeted indexes, like customer_id and event_type
- Rollups (saving the output of queries) to precompute the views that most matter to your customers
Logging and event tracking is typically less user facing, and serves the purpose of an audit log. This data represents a full record of all your logs/events that happen within the system. The characteristics of this data are:
- Similar ingest requirements as metrics / analytics, however
- Query pattern can be more complex, and
- Query responsiveness can be slower
- Budget requirements target data at cheaper durable storage
- A new use-case with this data requires flexibility for a new index
Can Postgres work as a tool for X? Yes. Postgres is a great tool that is powerful and flexible. Separate workloads on different clusters, then, use the Postgres Foreign Data Wrapper to re-connect them. It avoids premature optimizations in many cases, and can be trusted to scale as you need.
May 1, 2023 •More by this author