Query performance analytics

When applications store and retrieve data from databases, it is important to monitor which queries are performing well, and which are not. Because data is being added and changed, and applications evolve as needs change, this process should be frequent and constant.

pg_stat_statements is a PostgreSQL extension that tracks query statistics. It is extremely useful to find tuning opportunities in your database schema or queries.

Enabling

To use it, you first need to enable it (no need to do so on the PostgreSQL window here)

Your Postgres server needs to have shared_preload_libraries = 'pg_stat_statements' in your postgresql.conf configuration file
Connect to your database and run

CREATE EXTENSION pg_stat_statements;

Extended display

pg_stat_statements carries a lot of information. If you're running these in psql, PostgreSQL's standard command-line client, first turn on extended display of results.

\x on

The whole thing

Simplest query you can run, with a lot of information, from every database where the extension has been enabled:


SELECT * FROM pg_stat_statements;

Per database

When you're trying to identify bad-performing queries, you're often looking for queries that are run against a single database, so let's join against pg_database so we can filter. The following query gets only a few columns, shows the database name, the top 5 queries with highest execution time, for every database:


SELECT
  d.datname, s.total_exec_time, s.calls, s.rows,
  s.query
FROM pg_stat_statements s JOIN pg_database d ON (s.dbid = d.oid)
ORDER BY total_exec_time DESC
LIMIT 5;

Most time consuming queries

What if we need to find queries that are using a lot of time of the database?

The total_exec_time column tells us the total time spent executing the statement, in milliseconds, which is very precise. Let's round that to 2 decimal places.

In this query we ask for the total execution time, the number of calls and rows, calculate the average time and the "percentage cpu" usage for each query.


SELECT
  d.datname, round(s.total_exec_time::numeric, 2) AS total_exec_time, s.calls, s.rows,
  round(s.total_exec_time::numeric / calls, 2) AS avg_time,
  round((100 * s.total_exec_time / sum(s.total_exec_time::numeric) OVER ())::numeric, 2) AS percentage_cpu,
  substring(s.query, 1, 50) AS short_query
FROM pg_stat_statements s JOIN pg_database d ON (s.dbid = d.oid)
ORDER BY percentage_cpu DESC
LIMIT 5;

With a list of the worst-performing queries in hand, we can dig deeper into each of them by running them individually through EXPLAIN ANALYZE. That will show us the execution plan that Postgres executed for the query, so we analyze where the query could use improvements.

Average Query Execution Time

What's the average execution time of ALL queries in ALL databases?


SELECT (sum(total_exec_time) / sum(calls))::numeric(6,3) AS avg_execution_time
FROM pg_stat_statements;

Queries that write the most to shared_buffers

Let's find the top 5 queries that write the most to shared_buffers, the amount of memory PostgreSQL uses for shared memory. That means these queries invalidate a lot of shared memory blocks, and could perhaps use some improvements.


SELECT query, shared_blks_dirtied
FROM pg_stat_statements
WHERE shared_blks_dirtied > 0
ORDER BY 2 desc
LIMIT 5;

Tables that might be needing an index

pg_stat_statements can tell us about queries that are getting executed many times, and taking a long time, among other things.

If we combine that information with statistics from other tables, such as pg_stat_user_tables and pg_stat_user_indexes we can also find more information to help us improve our performance.

Often bad-performing queries happen because they're missing an index to help the database find the right rows to return. This query will tell us which tables might be missing an index


SELECT relname, seq_scan - idx_scan AS too_much_seq, 
  CASE WHEN seq_scan - idx_scan > 0 THEN 'Missing Index?' ELSE 'OK' END,
  pg_relation_size(relid) AS rel_size, seq_scan, idx_scan 
FROM pg_stat_user_tables
WHERE schemaname <> 'information_schema' AND schemaname NOT LIKE 'pg%'
ORDER BY too_much_seq DESC;

   relname   | too_much_seq |      case      | rel_size | seq_scan | idx_scan
-------------+--------------+----------------+----------+----------+----------
 dependents  |           41 | Missing Index? |   139264 |       41 |        0
 countries   |           -3 | OK             |     8192 |        1 |        4
 locations   |           -3 | OK             |     8192 |        1 |        4
 regions     |          -24 | OK             |     8192 |        1 |       25
 departments |         -927 | OK             |     8192 |       81 |     1008
 employees   |        -3469 | OK             |   106496 |      161 |     3630

The dependents table has had 41 more sequential scans than index scans, and may need an index, based on the queries that have been recently executed.

It is important to understand that indexes do have a cost, and slow down writes to a table because the database has to update the index whenever data is added or changed. So one should NOT just add indexes to all columns of a table.

Conclusion

pg_stat_statements is a powerful way to find way to find useful information about queries and their effect on database performance. Combined with EXPLAIN and data from another view,pg_stat_activity, which we'll cover in another tutorial, you can find important information to help you better manage your applications.

Read some more on this blog post. Full documentation is at https://www.postgresql.org/docs/current/pgstatstatements.html.

Loading terminal...