Postgres Troubleshooting - DiskFull ERROR could not resize shared memory segment
There’s a couple super common Postgres errors you’re likely to encounter while using this database, especially with an application or ORM. One is the PG::DiskFull: ERROR: could not resize shared memory segment. It will look something like this.
"PG::DiskFull: ERROR: could not resize shared memory segment "/PostgreSQL.938232807" to 55334241 bytes: No space left on device"
Don’t panic
We see a good amount of support tickets from customers on this topic. If you see this error pass by in your logs. Don’t worry. Seriously. There’s no immediate reason to panic from a single one of these errors.
If you’re seeing them regularly or all the time, or your curious about how these are generated, let’s continue through some troubleshooting.
You aren’t really out of disk
In this case when it's stating "no space left on device" it's not talking about the entire disk, but rather the shared memory device at that exact moment. Segments are created there when a thread is allocating shared buffers for things like hashes, sorts, etc. Parallel workers will also allocate shared buffers. When there are not sufficient shared buffers remaining, the statement terminates with that sort of error.
The ‘disk full’ part of this error message is a bit of a red herring. This is an error that you'll see when your Postgres instance fails to allocate more memory in support of a query. It is not a real disk full message. Sometimes this happens when modest memory consuming queries that execute very slowly will end up tipping you past the available memory. Other times a huge memory-intensive query comes and takes a huge chunk of memory to cause this issue.
Why don’t these spill out to temp, like normally large queries? Well you probably just went over the total memory allocation. Work_mem is allocated for each query node that needs it, rather than once per query or session, meaning that a session can potentially consume many multiples of work_mem. For example, if max_parallel_workers is 8 and work_mem is 384MB, it's possible to use up to 3,072MB of shared buffers even with a single parallel hash join. If your query plan has 5 query nodes that would also allocate work_mem (ie. sorts / hash operations), and four parallel workers, you could be using (384MB x 5 query nodes x 4 workers) = 7.6GB of shared buffers. If you have 7.7 GB available, that’s not going to work.
To the logs we go
To see what’s going on with these errors, let’s get into the logs and see how often we’re seeing these. Search your logs for the resize memory issues.
$ grep -iR "could not resize shared memory" * | sed 's/.log.*//' | uniq -c
1597 postgresql-Fri
587 postgresql-Mon
325 postgresql-Sat
1223 postgresql-Sun
1395 postgresql-Thu
You can also look for the specific process ID mentioned in the OOM error. For
this one, its 5883275
.
Aug 08 16:34:31 4qd4kp2ot5bwlmdnp7566v4owy postgres[5883275]: [36-1] [5883275][client backend][17/20137143][0] [user=application,db=postgres,app=/rails/bin/rails] ERROR: could not resize shared memory segment "/PostgreSQL.2449246800" to 33554432 bytes: No space left on device
To track the error back to the origin, search your logs for that process id. You might see very long queries broken up into smaller sequence numbers like 42-1, 42-2, and 42-3 in this example
Aug 08 16:34:31 4qd4kp2ot5bwlmdnp7566v4owy postgres[5883275]: [42-1] [5883275][client backend][17/20137143][0] [user=application,db=postgres,app=/rails/bin/rails] ERROR: could not resize shared memory segment "/PostgreSQL.2551246800" to 5883275 bytes: No space left on device
Aug 08 16:34:31 4qd4kp2ot5bwlmdnp7566v4owy postgres[5883275]: [42-2] [5883275][client backend][17/20137143][0] [user=application,db=postgres,app=/rails/bin/rails] STATEMENT: SELECT COUNT(*)
FROM trucks t
JOIN truck_locations tl ON t.truck_id = tl.truck_id
JOIN jobs j ON tl.location_id = j.location_id
JOIN job_hiring_locations_trucks_join jhltj ON j.job_id = jhltj.job_id AND t.truck_id = jhltj.truck_id
JOIN drivers d ON j.driver_id = d.driver_id
JOIN driver_certifications dc ON d.driver_id = dc.driver_id
JOIN certifications c ON dc.certification_id = c.certification_id
Aug 08 16:34:31 4qd4kp2ot5bwlmdnp7566v4owy postgres[5883275]: [42-3] "JOIN maintenance_records mr ON t.truck_id = mr.truck_id
JOIN maintenance_types mt ON mr.maintenance_type_id = mt.maintenance_type_id
JOIN job_status js ON j.status_id = js.status_id
JOIN locations l ON tl.location_id = l.location_id
JOIN job_types jt ON j.job_type_id = jt.job_type_id
JOIN job_priorities jp ON j.priority_id = jp.priority_id
JOIN fuel_records fr ON t.truck_id = fr.truck_id
JOIN fuel_stations fs ON fr.fuel_station_id = fs.fuel_station_id
Look for patterns in the logs: Start looking at individual examples in the
errors and look for patterns with the event right before the OOM errors. Are you
seeing the same query? Maybe large sorts, or large JOIN
operations. Are you
seeing secondary process jobs, ie Sidekiq, cron, etc? Large analytics-type
queries? Those could be large or misconfigured.
Common Fixes for Could Not Resize Shared Memory Segment
Decrease reliance on hash tables and add indexes
In what I’ve seen in the wild, hash tables seem to be the main culprit for these
types of errors, so that’s a good place to start. Hash joins are used for very
large joins across tables and Postgres will create an in-memory hash table to
store some of the data. Systems with large amounts of memory or larger
work_mem
settings can favor hash joins over other join methods like nested
loops or merges if the data being joined is small enough to fit into work_mem
but large enough that (or indexed so that) a nested loop is inefficient.
You can see which strategy the query planner is using by looking at the query’s
EXPLAIN
plan, ie:
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM trucks t
Finalize Aggregate (cost=238.12..238.13 rows=1 width=8) (actual time=5.276..5.276 rows=1 loops=1)
Buffers: shared hit=29
-> Gather (cost=238.01..238.12 rows=2 width=8) (actual time=5.236..5.272 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=29
-> Partial Aggregate (cost=238.01..238.02 rows=1 width=8) (actual time=5.226..5.227 rows=1 loops=3)
Buffers: shared hit=29
-> HashAggregate (cost=238.00..238.01 rows=1 width=4) (actual time=5.213..5.217 rows=3 loops=3)
Group Key: trucks.id
Buffers: shared hit=29
-> Hash Join (cost=37.75..236.75 rows=500 width=4) (actual time=0.605..4.879 rows=70 loops=3)
Hash Cond: (truck_locations.job_id = trucks.id)
Buffers: shared hit=29
-> Seq Scan on truck_locations (cost=0.00..18.20 rows=820 width=8) (actual time=0.010..0.054 rows=10 loops=3)
Buffers: shared hit=3
-> Hash (cost=27.25..27.25 rows=820 width=4) (actual time=0.575..0.576 rows=10 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=26
Planning Time: 0.256 ms
Execution Time: 36.562 ms
Since the datasets being joined are fairly large, it may be possible to nudge the planner toward merge joins versus hash joins by adding indexes on the join keys of both of the tables. The join keys themselves are already indexed but since there are additional criteria in the queries for filters and other uses, including those columns in indexes can be beneficial.
A good rule of thumb is that if a query has a WHERE filter on column A and joins to another table via column B, a multicolumn index on (A, B) will help by reducing the amount of data being joined.
Decreasing work_mem
It is possible that your work_mem is set too generously and you’re allowing too much memory per worker.
Decreasing max_parallel_workers
You may want to peek at the settings you have for parallel workers. If you have a high work_mem setting, lots of parallel workers, and hash joins, you may be over allocating resources.
Dig into the queries
In a lot of cases, working through a specific query to make it more performant might be the place to go for fixing your OOM issues.
- Adding
WHERE
clauses orLIMIT
s toSELECT *
queries can be a good starting place. - Creating views or materialized views to store table join data could help your database as well.
Add more memory
After you’ve added indexes and done what you can with individual queries, if you continue to see these errors, you might need to add more memory to your machine.
Quick summary
- ERROR: could not resize shared memory segment is probably just a single query or operation that took up all your memory.
- If you just have one of these, its no big deal. If you have a lot of them, its still no big deal. This is Postgres, everything’s fixable. There’s some simple things you can do to optimize and add indexes before you upgrade your instance to larger memory.
- Look in your logs for the queries or processes giving the OOM errors.
- Your #1 place to look is for queries joining large tables where the data being
processed fits within
work_mem
. Adding indexes to strategically limit the amount of data being processed can help.