Over the past few weeks we've had several customers ask how they should architect their analytics pipeline. Common questions are:
- Should we use some kind of data warehouse or time series database?
- Is Postgres suitable for that type of workload?
- What are the pitfalls that I should worry about before I get started?
Well, turns out that Postgres can work great for a metrics/analytics pipeline. In fact, we're using good ole plain Postgres to power our own metrics on our customer Postgres instances. If you’re looking to build a metrics or analytics function for your own customers, yes Postgres can work for you. It's pretty easy to get started with this, and then optimize as needed for scaling.
Let’s dig in at building our metrics system start to finish. In future posts we'll follow up with more advanced design using rollups and other special case tooling such as hyperloglog.
Use a separate Postgres database
First things first, you likely want a separate database from your primary application database. Postgres in general is very good at maintaining a cache of frequently accessed data. However, your basic system of record data is very different from huge log flow of events. Having a separate database reduces risk from your primary production system, so start with a fresh Postgres database. Now onto designing the database itself.
Designing your schema
The most basic format may look something like this:
CREATE TABLE metric ( id uuid, customer_id uuid, system_id uuid, metric_id uuid, occurred_at timestamptz, value numeric(22,2) );
Let's dissect each of these values:
- id - Unique primary key for the metric recording
- customer_id - This is essentially a tenant key. Could be customer/organization/tenant, always useful to have for separation
- system_id - This could be user/system, but is often the level you report your metrics against
- metric_id - A reference to the metrics table that has your list of metrics, could be hardcoded text values as well.
- occurred_at - The time of the metric recording
- value - Value at that time
Here is another variation that is more focused on web analytics:
CREATE TABLE event ( id bigint, user_id bigint, event_id integer, event_type varchar(100), event_timpestamp timestamptz, session_id bigint, event_data jsonb );
Indexing for analytics dashboards in Postgres
If you are building this to power an analytics dashboard within your own product, most of the queries will likely be targeting per user or per customer/tenant. The general indexing scheme you'll want to follow is a set of indexes on:
- Customer/tenant id
- System/user id
- Metric id
With these three indexes you won't have to do too much else on day one to start building actual dashboards for your customers. Depending on the amount of scale you're expecting, you may want to invest quickly in the next piece of low hanging fruit, which is Postgres partitioning.
Partitioning for analytics dashboards in Postgres
from day one only requires a few extra steps. First, you're going to declare
your table from above
event to be ready for partitioning by one
additional clause. Then we'll use
pg_partman to help manage our partitions for
CREATE TABLE metric ( id uuid, customer_id uuid, system_id uuid, metric_id uuid, occurred_at timestamptz, value numeric(22,2) ) PARTITION BY RANGE (occurred_at) ; SELECT partman.create_parent( 'public.metric', 'thetime', 'native', 'daily', p_start_partition := (now() - interval '10 days')::date::text );
Next, determine our retention. In our case we're retaining just over 30 days of metrics so we set it at 31 days or retention:
UPDATE partman.part_config SET retention = '30 days' WHERE parent_table = 'public.metric';
Our result in our database now looks similar to the below. Note we've used
metric_point as we have a small lookup table that defines the
public | metric_point | partitioned table | application public | metric_point_20230509 | table | application public | metric_point_20230510 | table | application public | metric_point_20230511 | table | application public | metric_point_20230512 | table | application public | metric_point_20230513 | table | application
Looking at our own metrics at Crunchy Data
Let's take a quick moment and summarize our own setup as it stands today and what is coming next.
Today, our metrics run in a Crunchy Bridge Standard-8 plan running in us-east on AWS with 500GB of storage, which comes in at $190 a month. Our schema is similar to the above:
- A lookup table for the types of metrics
- Daily partitions for the metrics
- Indexes on customer id, cluster id, metric id
- Current ingest of 50m events per day
- Connection to system of record via foreign data wrapper
Going forward our next steps are:
- Incremental roll-ups leveraging pg_cron
- Hyperloglog for composable unique aggregation
What about performance?
As we mention briefly above, we are ingesting 50 million events per day. We admit this isn't earth shattering, but for the sizing of the system and current optimizations (as mostly lack of heavy investment thus far) we're quite pleased. CPU load on the system is hanging around a low single digit percentage most of the time.
What about querying and the user experience?
- A query for a single day of metrics for a specific cluster returns in right around 1.5ms.
- Querying 10 days worth of metrics for a cluster returns in roughly 30-50ms.
- A full 30 days view of metrics for a specific customer is around 100ms.
For dash-boarding for our customers, all of the above is well within an acceptable range. With future optimizations at the ready to ensure both higher ingest as well as improvements to responsiveness of dashboards, we are resting easy with Postgres for many years to come.
Can Postgres handle analytics? Well it definitely does for us.
May 11, 2023 •More by this author