One PID to Lock Them All: Finding the Source of the Lock in Postgres
On the Customer Success Engineering team at Crunchy Bridge, we run across customers with lock issues on their Postgres database from time to time. Locks can have a cascading effect on queries. If one process is locking a table, then a query can be waiting on the process before it, and the process before that one. Major lock issues can quickly take down an entire production Postgres instance or application.
In this post let’s look at why locks happen, and more importantly how to get to the bottom of a lock issue and the one process blocking everything else. That one process that blocks them all! Once you find the source of the lock, I’ll give you the options for terminating the process that created all your troubles in the first place.
Finding the source of the lock
Often you won’t immediately know that you have a lock issue. If something is off, queries aren’t returning, or your application is slow, finding statements blocked by locks is a great place to start.
1. Find processes that are waiting
Take a look at the pg_stat_activity
view for processes that are active
but have a wait_event
or wait_event_type
that are non-NULL:
SELECT
pid,
datname,
usename,
application_name,
client_addr,
client_port,
to_char (now (), 'YYYY-MM-DD HH24:MI:SS') as now,
to_char (now () - xact_start, 'DD HH24:MI:SS MS') as xact_time,
to_char (now () - query_start, 'DD HH24:MI:SS MS') as query_time,
state,
to_char (now () - state_change, 'DD HH24:MI:SS MS') as state_time,
wait_event,
wait_event_type,
left (query, 40)
FROM
pg_stat_activity
WHERE
state != 'idle'
and pid != pg_backend_pid ()
ORDER BY
query_time desc;
If a connection is active and waiting on a lock, then the wait_event
and wait_event_type
columns will be non-NULL. If that's the case (and it stays that way after a couple of runs of the query to ensure that you didn't just catch a short lock wait), record that affected PID. Here is a very simple example where I ran an update in a transaction, then in a different session added a column to the same table. The ALTER TABLE
in this case will not proceed until the transaction from the prior thread has been committed or rolled back. Here are the results - note the PID 295998 that is "active" but has wait_event=relation and wait_event_type=Lock
pid | datname | usename | application_name | client_addr | client_port | now | xact_time | query_time | state | state_time | wait_event | wait_event_type | left
--------+----------+----------+------------------+-----------------+-------------+---------------------+-----------------+-----------------+---------------------+-----------------+------------+-----------------+------------------------------------------
295995 | postgres | postgres | psql | 149.42.105.253 | 49327 | 2023-11-09 20:41:10 | 00 00:02:11 535 | 00 00:02:01 755 | idle in transaction | 00 00:02:01 755 | ClientRead | Client | RELEASE pg_psql_temporary_savepoint
295998 | postgres | postgres | psql | 149.42.105.253 | 49344 | 2023-11-09 20:41:10 | 00 00:01:55 550 | 00 00:01:01 138 | active | 00 00:01:01 138 | relation | Lock | alter table sampledata add column data02
(2 rows)
2. Find which PID is locking the table
Now we know that the PID (295998) is awaiting a lock on a relation (table), but we don’t know what process currently holds the lock on which it is waiting. To find it, we start by querying pg_locks
using the ID of the awaiting process:
SELECT
*
FROM
pg_locks
WHERE
pid = 295998
AND granted IS NOT true;
Here’s the result of that query:
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+--------+---------------------+---------+----------+------------------------------
relation | 5 | 16501 | | | | | | | | 6/6743 | 295998 | AccessExclusiveLock | f | f | 2023-11-09 20:40:08.98843+00
(1 row)
The locktype
column shows which of the other columns describe what Postgres is waiting on. In this example, locktype
is relation
, so we look to the relation
column to see the OID of the relation (16501) where the blocking process has an active lock.
3. Find the process with the existing lock
Now that we know which object is locked, we can once again query pg_locks
using the relation OID to see what is holding the current lock(s):
SELECT
*
FROM
pg_locks
WHERE
relation = 16501
AND granted IS true;
Here is the result:
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+--------+------------------+---------+----------+-----------
relation | 5 | 16501 | | | | | | | | 3/243227 | 295995 | RowExclusiveLock | t | f |
(1 row)
This shows that PID 295995 is the process holding the lock.
4. Find what that blocking process is doing
Now that we know which process has been granted the lock, we can go back to pg_stat_activity
to see what that PID is doing:
SELECT
pid,
state,
wait_event,
wait_event_type,
left (query, 40)
FROM
pg_stat_activity
WHERE
pid = 295995;
Here is the result:
pid | state | wait_event | wait_event_type | left
--------+---------------------+------------+-----------------+-------------------------------------
295995 | idle in transaction | ClientRead | Client | RELEASE pg_psql_temporary_savepoint
The last column is showing the last statement executed by that session, which in this case was the savepoint release after an update, but in most cases it will show an active transaction.
One lock to rule them all
The above statements are pretty straightforward once you know what you are looking for, but they can also be combined into a single statement for a general blocking / blocked query. The Postgres wiki has some good combined versions.
Often times you might find that the blocked statement is blocked by another (and another, and another still…). In those cases, it is still possible to trace all the way up to the One PID that blocks all the rest, but that can be an arduous, unexpected journey. For those cases, a colleague here at Crunchy Data, Brian Pace, wrote a query that helps to show locks waiting on other locks, rolling up to the PID holding the initial lock:
WITH sos AS (
SELECT array_cat(array_agg(pid),
array_agg((pg_blocking_pids(pid))[array_length(pg_blocking_pids(pid),1)])) pids
FROM pg_locks
WHERE NOT granted
)
SELECT a.pid, a.usename, a.datname, a.state,
a.wait_event_type || ': ' || a.wait_event AS wait_event,
current_timestamp-a.state_change time_in_state,
current_timestamp-a.xact_start time_in_xact,
l.relation::regclass relname,
l.locktype, l.mode, l.page, l.tuple,
pg_blocking_pids(l.pid) blocking_pids,
(pg_blocking_pids(l.pid))[array_length(pg_blocking_pids(l.pid),1)] last_session,
coalesce((pg_blocking_pids(l.pid))[1]||'.'||coalesce(case when locktype='transactionid' then 1 else array_length(pg_blocking_pids(l.pid),1)+1 end,0),a.pid||'.0') lock_depth,
a.query
FROM pg_stat_activity a
JOIN sos s on (a.pid = any(s.pids))
LEFT OUTER JOIN pg_locks l on (a.pid = l.pid and not l.granted)
ORDER BY lock_depth;
Example output from that statement:
pid | usename | datname | state | wait_event | time_in_state | time_in_xact | relname | locktype | mode | page | tuple | blocking_pids | last_session | lock_depth | query
--------+-------------+----------+---------------------+---------------------+-----------------+-----------------+------------+---------------+---------------------+------+-------+------------------------+--------------+------------+----------------------------------------------------
879401 | application | postgres | idle in transaction | Client: ClientRead | 00:29:53.512147 | 00:30:01.31748 | | | | | | | | 879401.0 | select * from sampledata where id=101 for update;
880275 | application | postgres | active | Lock: transactionid | 00:01:00.342763 | 00:01:00.459375 | | transactionid | ShareLock | | | {879401} | 879401 | 879401.1 | update sampledata set data = 'abc' where id = 101;
880204 | application | postgres | active | Lock: relation | 00:00:29.722705 | 00:00:29.722707 | sampledata | relation | AccessExclusiveLock | | | {879401,880275,879488} | 879488 | 879401.4 | alter table sampledata add column data03 integer;
880187 | application | postgres | active | Lock: relation | 00:00:03.580716 | 00:00:03.580718 | sampledata | relation | RowExclusiveLock | | | {880204} | 880204 | 880204.2 | update sampledata set data = 'abc' where id = 103;
879527 | application | postgres | active | Lock: relation | 00:00:14.974433 | 00:28:32.80346 | sampledata | relation | RowExclusiveLock | | | {880204} | 880204 | 880204.2 | update sampledata set data = 'abc' where id = 102;
879488 | application | postgres | active | Lock: tuple | 00:00:41.35361 | 00:00:41.47118 | sampledata | tuple | ExclusiveLock | 2 | 21 | {880275} | 880275 | 880275.2 | update sampledata set data = 'def' where id = 101;
(6 rows)
In this manufactured example we have:
879401 - the “idle in transaction” PID - This is a SELECT... FOR UPDATE
within a transaction. Its blocking_pids
field is blank because it’s not blocked by any other process. This is the process in this example that is blocking everything else.
880275 - Attempting to update the same id=101
- It’s blocked until the FOR UPDATE
is completed.
879488 - Again attempting to update the same id=101
- It can’t execute until the process blocking it completes. It’s waiting on 880275 since it came in afterwards. If 880275 is canceled, it will just roll up to the next blocker, 879401.
880204 - Here added in an ALTER TABLE
- since it takes an access exclusive lock, note it’s blocking_pids
shows all three of the prior statements - it won’t be able to execute until each of those are out of the way
879527 - Blocked by the ALTER TABLE
since it requires an AccessExclusiveLock. Note that it’s still blocked, even though it’s a different row (id=102
).
880187 - Blocked also by ALTER TABLE
. They are at the same lock_depth
since they are both blocked by the same thing, but not by each other.
Ending the process holding the lock
Ok, now we’ve found the PID at the top of the tree, that one locking holding the key to the rest of our locks. Fortunately, as Postgres wizards, we do possess the craft to unmake the lock.
Commit
If the statement is showing as idle in transaction
it is possible that you have a non-committed transaction open that started with a BEGIN
statement. In that case you can commit with:
COMMIT;
Rollback
You may have performed some unintended updates, or run into an error. In that case you can abort the transaction and rollback any changes already made with:
ROLLBACK;
Cancel the PID
If this wasn't an transaction you initiated, in most cases you can cancel the running query with:
SELECT pg_cancel_backend(PID);
Terminate the backend connection and process
If the cancel statement above doesn’t work, you can cast the lock back into the fiery chasm from whence it came by executing a terminate back end statement. This will end the process and its associated database connection.
SELECT pg_terminate_backend(PID);
Why did Postgres lock?
Postgres’ multi-version concurrently control system is incredibly advanced and by and large is letting you query, update, and insert rows without locking tables. There are two main kinds of locks:
- Shared locks - the resource can be accessed by more than one backend/session at the same time
- Exclusive locks - the resource can only be accessed by a single backend/session at a time
The lock type that generally gets us into trouble and blocks other queries and processes are exclusive locks. If you want an overview, see David’s post, Postgres Locking: When Is It Concerning? There are probably hundreds of ways to put an exclusive lock on a table, but these are the most common ones we see with our customers.
Alter Table
By far the most common event I see to take an exclusive and detrimental lock is an ALTER TABLE
command, which can be issued to the database directly or in some cases via the application’s ORM while running migrations. The ALTER TABLE
itself takes an ACCESS EXCLUSIVE
lock (see ALTER TABLE docs) which pretty much blocks every other process on that table.
ORM framework
ORM frameworks can hide circular dependencies that produce deadlocks. An error on the application side, where other operations run into errors while being executed within the transaction scope, can cause locks and result in future transactions taking a long time to complete.
Create index
Creating indexes can lock tables if you’re not using CREATE INDEX CONCURRENTLY
.
Vacuum
VACUUM FULL
will take out an ACCESS EXCLUSIVE
lock against a table, so should be used only in rare cases.
Other
The Postgres documentation has a table showing the different lock modes, how they might block each other, and some examples of statement types that result in those locks.
Getting proactive about locks
Let’s look at a few tips for managing locking in the future.
Logging lock_waits
You can log any time your query is waiting on a lock by turning on log_lock_waits
. Lock_waits in your logs can be a good indicator that processes are being contentious. There is virtually no overhead on enabling this and it’s very safe for production databases. This is set to “on” by default on Crunchy Bridge clusters:
log_lock_waits = on
Set a lock timeout
We generally recommend clients set a lock_timeout
within a session so that it will cancel the transaction and relinquish any locks it was holding after a certain period of time. This helps to prevent other processes from getting caught up behind them in a chain.
ALTER SYSTEM SET lock_timeout = '10s';
Summary
- Find processes waiting on locks in
pg_stat_activity
by looking for processes that areactive
but have await_event
orwait_event_type
that are non-NULL. - Use this query to find the source of the lock (seriously save this query somewhere, you might need it someday).
WITH sos AS (
SELECT array_cat(array_agg(pid),
array_agg((pg_blocking_pids(pid))[array_length(pg_blocking_pids(pid),1)])) pids
FROM pg_locks
WHERE NOT granted
)
SELECT a.pid, a.usename, a.datname, a.state,
a.wait_event_type || ': ' || a.wait_event AS wait_event,
current_timestamp-a.state_change time_in_state,
current_timestamp-a.xact_start time_in_xact,
l.relation::regclass relname,
l.locktype, l.mode, l.page, l.tuple,
pg_blocking_pids(l.pid) blocking_pids,
(pg_blocking_pids(l.pid))[array_length(pg_blocking_pids(l.pid),1)] last_session,
coalesce((pg_blocking_pids(l.pid))[1]||'.'||coalesce(case when locktype='transactionid' then 1 else array_length(pg_blocking_pids(l.pid),1)+1 end,0),a.pid||'.0') lock_depth,
a.query
FROM pg_stat_activity a
JOIN sos s on (a.pid = any(s.pids))
LEFT OUTER JOIN pg_locks l on (a.pid = l.pid and not l.granted)
ORDER BY lock_depth;
- End the lock by canceling the pid or issuing a
COMMIT
orROLLBACK
of the process that’s holding the lock and blocking the other processes - Be careful with
ALTER TABLE
commands,CREATE INDEX
withoutCONCURRENTLY
part, or runaway processes from your ORM that may be holding exclusive locks and blocking general database processing. - It can be a good idea to set a
lock_timeout
and it is generally a good idea to log lock waits if you’re doing proactive logging to keep track of ongoing problems.
Thanks to my colleague Brian Pace for the great cascading locks query.