Simulating UPDATE or DELETE with LIMIT in Postgres: CTEs to The Rescue!
There have certainly been times when using PostgreSQL, that I’ve yearned for an UPDATE
or DELETE
statement with a LIMIT
feature. While the SQL standard itself has no say in the matter as of SQL:2016, there are definite cases of existing SQL database dialects that support this.
Sadly, if you try to do something like this in PostgreSQL, this is the result:
ERROR: syntax error at or near "LIMIT"
LINE 1: DELETE FROM big_table LIMIT 10000;
^
Use Cases
Before we dig into the nitty-gritty, let's look at some use cases for such a feature. The primary desire for this behavior is to break large transactions up into smaller ones, for multiple reasons:
Batch UPDATE
and DELETE
Queries are intended to be run to completion but for performance or other reasons, we want to repeatedly issue the action in chunks rather than in one large transaction. This can be to minimize lock holding or contention, to prevent unneeded table bloating, or otherwise limit the impact on overall system resources.
Better disk space usage, both table data and WAL
If you UPDATE
a single large table in-place in one transaction, the database will have to keep all old row versions around in case of transaction rollback or failure. As such, you’re effectively doubling the on-disk size of the database changes. Chunking it up into multiple UPDATE
statements with LIMIT
will allow the database better reuse of existing free space if interspersed with VACUUM
.
For some database setups, it can make more sense to do an operation like this in chunks as well for the increased WAL overhead for making all of these changes; spacing out operations can sometimes be useful to distribute the load on backup systems, replicas, etc.
Affect some arbitrary subset of rows
You may want to handle some operations (particularly DELETE
) using a subset of all of the total rows, without full regard for the specific rows you are operating. For instance, you may have a table that you want to pull some arbitrary number of rows out of and use in other parts of a query, such as with the fictional syntax DELETE FROM foo ORDER BY random() LIMIT 10 RETURNING *
.
Migrating rows to new partitions without affecting the entirety of the table
In particular, when moving existing table data to partitions you will want to break these operations into chunks. As you move data to new partitions, the datasets returned by queries would continue to be the same, however if you naïvely just issued INSERT
and DELETE
on all of the rows in the table, this would pause queries against this data for the duration of the transaction required to move the data in question, as well as having the same space-based constraints involved with keeping old data around in the old tables until the operations and latest VACUUM
complete. pg_partman uses this approach for populating table data.
A word of warning
Operating (particularly for a destructive operation) on some poorly-defined criteria is purposefully difficult. In general, the ORDER BY
clause is strongly encouraged for any operation using LIMIT
for a mutable query.
Let's try it
So let's try a basic example here:
CREATE TABLE big_table (id INT PRIMARY KEY, data TEXT, process_time TIMESTAMPTZ);
INSERT INTO big_table (id, data) SELECT v, 'Foo'||v FROM generate_series(1,100000) v;
Table "public.big_table"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
id | integer | | not null |
data | text | | |
process_time | timestamp with time zone | | |
Indexes:
"big_table_pkey" PRIMARY KEY, btree (id)
Verify some data exists:
SELECT * FROM big_table ORDER BY id LIMIT 10;
id data process_time
1 Foo1
2 Foo2
3 Foo3
4 Foo4
5 Foo5
6 Foo6
7 Foo7
8 Foo8
9 Foo9
10 Foo10
CTEs to the Rescue!
Since SELECT
certainly is able to provide the sort of limitation we want, then clearly the answer here is to be able to turn the DELETE
or UPDATE
statement in question into a SELECT
, using one of my personal favorite tools, the Common Table Expression (CTE).
Using this approach, we can often structure our DELETE
or UPDATE
query to first use a SELECT
to define the affected rows, then perform the underlying operation in question on this; basically looking something like this general recipe:
WITH rows AS (
SELECT
something
FROM
big_table
LIMIT 10
)
DELETE FROM big_table
WHERE something IN (SELECT something FROM rows)
;
So what is the something
that is being referred to here? If the table has a primary key, this is the most straightforward way to use this data, and in fact there will be a plan something like this following:
EXPLAIN WITH rows AS (
SELECT
id
FROM
big_table
LIMIT 10
)
DELETE FROM big_table
WHERE id IN (SELECT id FROM rows)
;
Here is this query plan:
Delete on big_table (cost=0.60..83.49 rows=0 width=0)
-> Nested Loop (cost=0.60..83.49 rows=10 width=34)
-> HashAggregate (cost=0.31..0.41 rows=10 width=32)
Group Key: rows.id
-> Subquery Scan on rows (cost=0.00..0.29 rows=10 width=32)
-> Limit (cost=0.00..0.19 rows=10 width=4)
-> Seq Scan on big_table big_table_1 (cost=0.00..1152.33 rows=61133 width=4)
-> Index Scan using big_table_pkey on big_table (cost=0.29..8.31 rows=1 width=10)
Index Cond: (id = rows.id)
The same thing, but using a join condition instead of the IN(...)
construct:
EXPLAIN WITH rows AS (
SELECT
id
FROM
big_table
LIMIT 10
)
DELETE FROM big_table
WHERE EXISTS (SELECT * FROM rows WHERE rows.id = big_table.id)
;
Delete on big_table (cost=0.60..83.49 rows=0 width=0)
-> Nested Loop (cost=0.60..83.49 rows=10 width=34)
-> HashAggregate (cost=0.31..0.41 rows=10 width=32)
Group Key: rows.id
-> Subquery Scan on rows (cost=0.00..0.29 rows=10 width=32)
-> Limit (cost=0.00..0.19 rows=10 width=4)
-> Seq Scan on big_table big_table_1 (cost=0.00..1152.33 rows=61133 width=4)
-> Index Scan using big_table_pkey on big_table (cost=0.29..8.31 rows=1 width=10)
Index Cond: (id = rows.id)
And with the ORDER BY
condition included in the CTE:
EXPLAIN ANALYZE WITH rows AS (
SELECT
id
FROM
big_table
ORDER BY id
LIMIT 10
)
DELETE FROM big_table
USING rows WHERE big_table.id = rows.id
;
SELECT COUNT(*) FROM big_table;
Delete on big_table (cost=0.58..84.15 rows=0 width=0) (actual time=0.209..0.212 rows=0 loops=1)
-> Nested Loop (cost=0.58..84.15 rows=10 width=34) (actual time=0.091..0.152 rows=10 loops=1)
-> Subquery Scan on rows (cost=0.29..1.07 rows=10 width=32) (actual time=0.072..0.087 rows=10 loops=1)
-> Limit (cost=0.29..0.97 rows=10 width=4) (actual time=0.049..0.057 rows=10 loops=1)
-> Index Only Scan using big_table_pkey on big_table big_table_1 (cost=0.29..4185.28 rows=61133 width=4) (actual time=0.047..0.053 rows=10 loops=1)
Heap Fetches: 10
-> Index Scan using big_table_pkey on big_table (cost=0.29..8.31 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=10)
Index Cond: (id = rows.id)
Planning Time: 0.835 ms
Execution Time: 0.338 ms
99990
UPDATE
statements
Let's take a look at how this affects UPDATE
statements:
EXPLAIN ANALYZE WITH rows AS (
SELECT
id
FROM
big_table
ORDER BY id
LIMIT 10
)
UPDATE big_table
SET process_time = now()
WHERE EXISTS (SELECT * FROM rows WHERE big_table.id = rows.id)
;
Update on big_table (cost=1.39..84.30 rows=0 width=0) (actual time=0.314..0.317 rows=0 loops=1)
-> Nested Loop (cost=1.39..84.30 rows=10 width=42) (actual time=0.098..0.148 rows=10 loops=1)
-> HashAggregate (cost=1.10..1.20 rows=10 width=32) (actual time=0.079..0.085 rows=10 loops=1)
Group Key: rows.id
Batches: 1 Memory Usage: 24kB
-> Subquery Scan on rows (cost=0.29..1.07 rows=10 width=32) (actual time=0.055..0.066 rows=10 loops=1)
-> Limit (cost=0.29..0.97 rows=10 width=4) (actual time=0.036..0.042 rows=10 loops=1)
-> Index Only Scan using big_table_pkey on big_table big_table_1 (cost=0.29..4185.28 rows=61133 width=4) (actual time=0.034..0.039 rows=10 loops=1)
Heap Fetches: 20
-> Index Scan using big_table_pkey on big_table (cost=0.29..8.31 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=10)
Index Cond: (id = rows.id)
Planning Time: 0.773 ms
Execution Time: 0.461 ms
SELECT COUNT(*) FROM big_table WHERE process_time IS NOT NULL;
count
10
Including randomness into this selection
Okay, so that's great if you want to use a table that has an index with a specified order, but what if we wanted to choose arbitrary records instead?
The simple solution in this case (at some performance penalty) is adding an ORDER BY RANDOM()
to the inner SELECT
CTE. We also minimize the number of rows being considered here by using the TABLESAMPLE
clause of the SELECT
statement to reduce the number of rows being looked at here randomly to 1% of the table. (See TABLESAMPLE
on the SELECT
documentation page for more explanation/detail.)
EXPLAIN ANALYZE WITH rows AS (
SELECT
id
FROM
big_table TABLESAMPLE bernoulli(1)
ORDER BY RANDOM()
LIMIT 10
)
UPDATE big_table
SET process_time = now()
WHERE EXISTS (SELECT * FROM rows WHERE big_table.id = rows.id)
;
Update on big_table (cost=562.38..645.29 rows=0 width=0) (actual time=6.002..6.004 rows=0 loops=1)
CTE rows
-> Limit (cost=561.84..561.87 rows=10 width=12) (actual time=5.637..5.639 rows=10 loops=1)
-> Sort (cost=561.84..563.37 rows=611 width=12) (actual time=5.635..5.636 rows=10 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 26kB
-> Sample Scan on big_table big_table_1 (cost=0.00..548.64 rows=611 width=12) (actual time=0.064..5.346 rows=973 loops=1)
Sampling: bernoulli ('1'::real)
-> Nested Loop (cost=0.52..83.43 rows=10 width=42) (actual time=5.700..5.781 rows=10 loops=1)
-> HashAggregate (cost=0.23..0.33 rows=10 width=32) (actual time=5.677..5.684 rows=10 loops=1)
Group Key: rows.id
Batches: 1 Memory Usage: 24kB
-> CTE Scan on rows (cost=0.00..0.20 rows=10 width=32) (actual time=5.656..5.664 rows=10 loops=1)
-> Index Scan using big_table_pkey on big_table (cost=0.29..8.31 rows=1 width=10) (actual time=0.008..0.008 rows=1 loops=10)
Index Cond: (id = rows.id)
Planning Time: 1.074 ms
Execution Time: 6.176 ms
What to do with no PK?
This approach so far depends on having a primary key on a table, however you might want to handle this for a table that doesn't have a PK. What to do?
Well, the purpose of having a PK in this case is to uniquely identify the rows/tuples in question that we want to target. And as you may or may not know, PostgreSQL has a hidden system column on all tables called the ctid
which identifies the explicit tuple on-disk; this is an indexed column (in the way that the value itself identifies quickly and uniquely the exact table block and tuple id of the specific row), so for our purposes this functions exactly the same as a Primary Key would.
Let's test:
CREATE TABLE another_table (id INT, data TEXT, process_time TIMESTAMPTZ);
INSERT INTO another_table (id, data) SELECT v, 'Foo'||v FROM generate_series(1,100000) v;
Note that this is the exact same table definition as before, with the exception that you do not create a primary key and hence there is no index on the table:
Table "public.another_table"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
id | integer | | |
data | text | | |
process_time | timestamp with time zone | | |
Let's try the initial query we previously did without the ctid
change:
EXPLAIN WITH rows AS (
SELECT
id
FROM
another_table
ORDER BY id
LIMIT 10
)
UPDATE another_table
SET process_time = now()
WHERE EXISTS (SELECT * FROM rows WHERE another_table.id = rows.id)
;
Update on another_table (cost=2473.64..3828.10 rows=0 width=0)
-> Hash Semi Join (cost=2473.64..3828.10 rows=3057 width=42)
Hash Cond: (another_table.id = rows.id)
-> Seq Scan on another_table (cost=0.00..1152.33 rows=61133 width=10)
-> Hash (cost=2473.52..2473.52 rows=10 width=32)
-> Subquery Scan on rows (cost=2473.39..2473.52 rows=10 width=32)
-> Limit (cost=2473.39..2473.42 rows=10 width=4)
-> Sort (cost=2473.39..2626.22 rows=61133 width=4)
Sort Key: another_table_1.id
-> Seq Scan on another_table another_table_1 (cost=0.00..1152.33 rows=61133 width=4)
That query plan is, well, yuck (as would be expected for a table without an index or primary key). Let's see with the ctid
utilized:
EXPLAIN WITH rows AS (
SELECT
ctid
FROM
another_table
ORDER BY ctid
LIMIT 10
)
UPDATE another_table
SET process_time = now()
FROM rows
WHERE another_table.ctid = rows.ctid
;
Update on another_table (cost=2473.39..2513.77 rows=0 width=0)
-> Nested Loop (cost=2473.39..2513.77 rows=10 width=44)
-> Subquery Scan on rows (cost=2473.39..2473.52 rows=10 width=36)
-> Limit (cost=2473.39..2473.42 rows=10 width=6)
-> Sort (cost=2473.39..2626.22 rows=61133 width=6)
Sort Key: another_table_1.ctid
-> Seq Scan on another_table another_table_1 (cost=0.00..1152.33 rows=61133 width=6)
-> Tid Scan on another_table (cost=0.00..4.01 rows=1 width=6)
TID Cond: (ctid = rows.ctid)
Note that this particular example would really only work in the case that we don't care on the order of the items selected against, since we'd still be needing to sort using an unindexed field. Moral of the story: only do this with indexed fields.
Summary
Unless and until PostgreSQL supports UPDATE
and DELETE
with LIMIT
, the most appropriate workaround for this in pure SQL is to use the CTE approach.