Fun with PostgreSQL puzzles: Finding shortest paths and travel costs with functions
This article will contain spoilers both on how I solved 2022 Day 16's challenge "Probscidea Volcanium" using SQL, as well as general ideas on how to approach the problem. I recommend trying to solve it yourself first, using your favorite language. This article is delayed from the actual puzzle's release. My solutions may not be the "best" solution, as the goal is to provide a quick solution.
Hands on Tutorial
We've also loaded a tutorial for Day 16's challenge if you want to try it with a pre-loaded data set.
I am not going to lie, this challenge was a tough one. That factor, plus other life obligations, means this post is long overdue. Day 16 was near and dear to me, as it involves elephants, the Postgres mascot! In this scenario, we find ourselves trapped with a herd of elephants in a volcano, amongst a bunch of interconnected tubes, containing pressure release valves. In short, the world's most interesting Travelling Salesman Problem (TSP)!
Amongst the Postgres / SQL items used to solve this are:
- The
file_fdw
extension, to read text files in from the system. - CTEs (Common Table Expressions) to keep our queries readable and avoid ugly sub-selects.
- PL/pgSQL functions and some upserts (i.e.
ON CONFLICT
handlers). - Various string formatting functions such as
substr
,string_to_array
,substring
,left
,right
, andposition
- Array items including the
unnest
function and the array "overlaps" operator&&
. - Unlogged tables
- The
pg_stat_user_functions
view.
Following the link to Day 16 will give a lot more details about the challenge, you may want to read that first. We are provided with an input file that looks like this:
Valve AA has flow rate=0; tunnels lead to valves DD, II, BB
Valve BB has flow rate=13; tunnels lead to valves CC, AA
Valve CC has flow rate=2; tunnels lead to valves DD, BB
Valve DD has flow rate=20; tunnels lead to valves CC, AA, EE
Valve EE has flow rate=3; tunnels lead to valves FF, DD
Valve FF has flow rate=0; tunnels lead to valves EE, GG
Valve GG has flow rate=0; tunnels lead to valves FF, HH
Valve HH has flow rate=22; tunnel leads to valve GG
Valve II has flow rate=0; tunnels lead to valves AA, JJ
Valve JJ has flow rate=21; tunnel leads to valve II
AOC 2022 Day 16 Part One
The goal in Part One is to reduce as much pressure as possible in 30 minutes. Times are cumulative, so opening valve BB releases 13 units of pressure for every minute it is open after that. Travelling from one valve (aka node) to another costs one minute, and opening a valve also costs one minute. The goal is to find the optimal path to release the most pressure in 30 minutes, when beginning at node AA.
We will start with our usual steps to transform the text file above into something more SQL friendly:
CREATE EXTENSION if not exists file_fdw;
CREATE SERVER if not exists aoc2022 FOREIGN DATA WRAPPER file_fdw;
DROP SCHEMA if exists elephant CASCADE;
CREATE SCHEMA elephant;
SET search_path = elephant;
DROP FOREIGN TABLE if exists aoc_day16;
CREATE FOREIGN TABLE aoc_day16 (line text)
SERVER aoc2022 options(filename '/tmp/aoc2022.day16.input');
Let's start by parsing those lines using regular expressions into separate values. In the block below, we are converting our text file into separate rows in the tubes table. First we use some regexes to pull information out of each line into a CTE named y
which contains an array of valves connected to the source one:
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
SELECT valve, rate, targets FROM y LIMIT 10;
valve | rate | targets
-------+------+------------
AA | 0 | {DD,II,BB}
BB | 13 | {CC,AA}
CC | 2 | {DD,BB}
DD | 20 | {CC,AA,EE}
EE | 3 | {FF,DD}
FF | 0 | {EE,GG}
GG | 0 | {FF,HH}
HH | 22 | {GG}
II | 0 | {AA,JJ}
JJ | 21 | {II}
Before we go any further, we need to simplify things. If you notice that some of the valves have a zero rate - they can be expunged from the final list. For example, here are valves EE and FF:
EE | 3 | {FF,DD}
FF | 0 | {EE,GG}
So EE leads to FF and DD, and opening valve EE will release 3 pressure. Valve FF leads to EE and GG, but releases no pressure. So there is no reason to ever stop at FF. In which case, the only reason for its existence is as a way to connect one valve to another. In this case, we can shorten EE -> FF -> GG to the path EE -> GG with an additional travel cost. So what we want to do is break things into simple paths from one valve to another, along with its travel cost. We discard any that end in a valve that has a zero value, as it does not need opening. We need a function that does this by walking through every single valve and finding the shortest path from one valve to another. Before running the function, we want to have a place to store the results, so let's create a simple table, along with a couple of indexes:
CREATE UNLOGGED TABLE tubes (start text, cost smallint, finish text, rate int);
CREATE UNIQUE INDEX path_unique ON tubes (start, finish) INCLUDE (cost, rate);
CREATE INDEX tube_index ON tubes (cost,start,finish) include (rate);
We use the INCLUDE
clause to ensure that while we have a unique index on the start and finish columns, we also store the other two columns as well, so we can use an index-only scan on that index as we need to grab all the columns later on.
The function itself:
CREATE or REPLACE function lesspaths()
returns text language plpgsql as $$
DECLARE
myrec record; myrec2 record; round int = 0; pathmade bool;
BEGIN
-- Populate the first round
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
,z AS (SELECT valve, rate, unnest(targets) AS ending FROM y)
INSERT INTO tubes
SELECT valve, 1, ending,
(select y.rate from y where y.valve = z.ending limit 1) rate
FROM z;
LOOP
round = round + 1;
pathmade = false;
FOR myrec IN SELECT * FROM tubes WHERE cost = round LOOP
FOR myrec2 IN SELECT * FROM tubes WHERE cost=1 LOOP
-- AABB joins with BBCC to become AACC
IF myrec.finish = myrec2.start AND myrec.start <> myrec2.finish
THEN
INSERT INTO tubes
SELECT myrec.start, myrec.cost+1, myrec2.finish, myrec2.rate
ON CONFLICT DO NOTHING;
IF FOUND THEN pathmade = true; END IF;
END IF;
END LOOP; -- myrec2
END LOOP; -- myrec
IF NOT pathmade THEN RAISE NOTICE 'Leaving after % rounds', round; EXIT; END IF;
END LOOP;
DELETE FROM tubes WHERE rate=0;
RETURN 'Done with lesspaths';
END;
$$;
Let's break it down:
CREATE or REPLACE function lesspaths()
returns text language plpgsql as $$
DECLARE
myrec record; myrec2 record; round int = 0; pathmade bool;
BEGIN
This is all basic setup - declare the language, what the function returns, and what variables we are going to use.
-- Populate the first round
WITH x AS (SELECT * FROM aoc_day16)
,y AS (SELECT line
,substr(line,7,2) AS valve
,substring(line from 'rate=(\d+)')::int AS rate
,string_to_array(substring(line from 'valves? (.+)'),', ') AS targets
FROM x)
,z AS (SELECT valve, rate, unnest(targets) AS ending FROM y)
INSERT INTO tubes
SELECT valve, 1, ending,
(select y.rate from y where y.valve = z.ending limit 1) rate
FROM z;
In this block, we are converting our text file into separate rows in the tubes table. First we use some regexes to pull information out of each line into a CTE named y
. Then we turn around and use the unnest
function to break the array we built apart, such that we have a mapping from one valve to one other valve. We then insert that into the tubes
table, using a simple sub-select (select y.rate from y where y.valve = z.ending limit 1)
to record the rate of the "ending" valve for each pair. At this point, the tubes table looks like this:
DD | 1 | CC | 2
DD | 1 | AA | 0
DD | 1 | EE | 3
EE | 1 | FF | 0
EE | 1 | DD | 20
FF | 1 | EE | 3
FF | 1 | GG | 0
Now the tricker bit. At a high level, we are going to walk through every new row, see if we can extend the list of valves to add another on the end, and insert it as a new row:
LOOP
round = round + 1;
pathmade = false;
FOR myrec IN SELECT * FROM tubes WHERE cost = round LOOP
FOR myrec2 IN SELECT * FROM tubes WHERE cost=1 LOOP
We start a generic loop, that keeps going until we break out of it with an EXIT
command. The round
variable is a proxy for the number of steps it takes to get from one valve to another. As seen above, everything in the tubes table is currently one step from each other. Therefore, the first time through we are looping through all rows in the tubes table, and putting each row into the myrec
variable. Then for each of those, we walk through all the initial items, in other words, valves that are one step from each other with a cost of 1.
-- AABB joins with BBCC to become AACC
IF myrec.finish = myrec2.start AND myrec.start <> myrec2.finish
THEN
INSERT INTO tubes
SELECT myrec.start, myrec.cost+1, myrec2.finish, myrec2.rate
ON CONFLICT DO NOTHING;
Next we check if the current path is joinable to another node. In other words, if we have valve A going to valve D, and we know that valve D goes to valve Q, we know that travel is possible from valve A to valve Q. We also add in the myrec.start <> myrec2.finish
qualifier, so we don't try to make a A -> D -> A
path. For each of the new paths we find, we attempt to insert it into our tubes table. Recall that the tubes table has a unique constraint on start, finish
. That allows us to use ON CONFLICT DO NOTHING
to ensure only a single row exists for each pair of valves. Because we are walking them in order of travel cost, we are finding the shortest distance between any two valves! Wrapping up:
IF FOUND THEN pathmade = true; END IF;
END IF;
END LOOP; -- myrec2
END LOOP; -- myrec
IF NOT pathmade THEN
RAISE NOTICE 'Leaving after % rounds', round; EXIT;
END IF;
END LOOP;
We use IF FOUND
to tell if the INSERT
statement above it actually was able to add a row or not. This allows us to keep looping at higher and higher travel costs until there are no new paths. So once a round has no new INSERT
S at all, pathmade
stays false, and we can then leave the entire main loop with the EXIT
command, as we have found all combinations of valves.
DELETE FROM tubes WHERE rate=0;
RETURN 'Done with lesspaths';
END;
$$;
Our final command at the end of the function is to remove any "dead" paths, in which the ending valve has a rate of 0. As you recall, there is no reason to ever end up on these nodes, so we can remove them. At this point, the tubes table looks like this:
start | cost | finish | rate
-------+------+--------+------
AA | 1 | BB | 13
AA | 2 | CC | 2
AA | 1 | DD | 20
AA | 2 | EE | 3
AA | 5 | HH | 22
AA | 2 | JJ | 21
BB | 1 | CC | 2
BB | 2 | DD | 20
BB | 3 | EE | 3
BB | 6 | HH | 22
BB | 3 | JJ | 21
All of this is setup for the real problem. The lesspaths
function only takes about half a second:
SELECT lesspaths();
NOTICE: Leaving after 17 rounds
lesspaths
---------------------
Done with lesspaths
(1 row)
Time: 530.416 ms
The actual input file had 64 lines, which led to 4032 rows in the tubes table before the final deletion, and 945 rows after the deletion. Still a lot of paths, but manageable as we will see in the next function:
CREATE OR REPLACE function walkcost
(startvalve text, timeleft smallint, openvalves text)
returns smallint immutable language plpgsql as $$
DECLARE
myrec record; thiscost smallint; newtime smallint; bestcost smallint = 0;
BEGIN
FOR myrec IN SELECT finish,cost,rate FROM tubes
WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
newtime = timeleft - myrec.cost - 1;
thiscost = (newtime * myrec.rate) + walkcost
(myrec.finish, newtime, openvalves || ' ' || myrec.finish || ' ');
IF thiscost > bestcost THEN bestcost = thiscost; END IF;
END LOOP;
RETURN bestcost;
END;
$$;
A small function, but it does a lot. It starts at a single valve (in our case 'AA'), with a known amount of time left, and a list of which valves are already open. We then walk through all possible valves reachable from the current one, and make a recursive call back to the same function to find the "downstream" cost. For each possible ending valve, we compute which one is the best (has reduced the most pressure), and return that back to the caller. Let's break it down a bit:
CREATE OR REPLACE function walkcost
(startvalve text, timeleft smallint, openvalves text)
returns smallint immutable language plpgsql as $$
DECLARE
myrec record; thiscost smallint; newtime smallint; bestcost smallint = 0;
BEGIN
Standard setup of setting the name of the function, its arguments, and declaring the additional variables to use. The first argument startvalve
is the current valve we are starting from. The second argument timeleft
is the amount of time left in minutes, which decreases as we travel from valve to valve, and also by one minute each time we open a valve. These two costs are combined, as we never travel without also opening a valve. The third argument openvalves
is a simple text string we will use to keep track of which valves are currently open. We also declare four more variables: myrec
stores a row from the tubes table, thiscost
is the total cost for opening a valve and then going to a specific path, newtime
helps adjust the timeleft
, and bestcost
stores the winning value of thiscost
in each loop.
FOR myrec IN SELECT finish,cost,rate FROM tubes
WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
Here we start the main loop. We want to find all rows that start with the current valve, and then we use timeleft > cost+1
to rule out visiting any valves that will cause us to run out of time before opening them. We also want to exclude any previously-opened valves, by searching our passed in string openvalves
. We pad the value with a space on both sides to make sure we have an exact match. For example, the string AA BB FF
will exclude any rows ending in AA, BB, or FF, as the position
function will return a non-zero value. So at this point, we have all possible paths and we can start looping through them one by one.
newtime = timeleft - myrec.cost - 1;
thiscost = (newtime * myrec.rate) + walkcost
(myrec.finish, newtime, openvalves || ' ' || myrec.finish || ' ');
IF thiscost > bestcost THEN bestcost = thiscost; END IF;
END LOOP;
RETURN bestcost;
In this section, we first decrease the time based on the travel cost to the new valve, plus an additional minute to account for the time to open the valve. Next, we compute the total cost of opening this valve with newtime * myrec.rate
. We add to that the cost of opening other valves after this one, by calling the function recursively and passing in our current ending valve as the new starting valve, the newly-decreased time, and the list of open valves with our current ending one appended to the list. The recursion happens until there are no more paths to consider, at which point the "best" values will bubble up and get returned to each level.
Once that has returned, the line IF thiscost > bestcost THEN bestcost = thiscost; END IF;
stores the winning value among all the 'finish' valves in the current loop. Then we return that cost.
This is a crude but very effective search of all possibilities of travel through the tubes, and at the end it will generate the final answer. The test input returns very quick and gives an answer of 1651
. The real input still returns quite fast, given all the work it is doing. Before we run it, let's vacuum the table to increase the odds of getting some index-only scans, and turn on track_functions
to get some stats about the calls:
VACUUM FREEZE tubes;
SET track_functions = 'pl';
\timing
SELECT walkcost('AA', 30::smallint, ' ');
walkcost
----------
1653
(1 row)
Time: 2716.884 ms (00:02.717)
Not too shabby at all! Let's peek at the pg_stat_user_functions
table to see how many times we recursively called this function (confusingly enough this is not a SQL 'recursive function', just a function that has been called recursively!)
SELECT * FROM pg_stat_user_functions WHERE funcname = 'walkcost';
funcid | schemaname | funcname | calls | total_time | self_time
---------+------------+----------+--------+------------+-----------
1054235 | elephant | walkcost | 233941 | 2783.138 | 2783.138
(1 row)
The function ran over 230 thousand times in less than three seconds! Or to put it another way, the function ran 84,000 times per second. Not bad. Part of the AOC challenge is optimizing your solution, both to make sure things run in a reasonable time frame, and because optimizing is fun. I did play with some ideas for optimizing the function above, but the only one that mattered was changing from int to smallint. Things that did not work:
- Using json or arrays to track the open valves - plain old text was faster.
- Doing a short-circuit of valves, by seeing at certain points if we abandon the current path because there would be no way it could beat the current winning path. While this is doable, it turns out the cost of looking that up is not worth it compared to brute forcing our way through the entire collection of paths.
- Transforming the text valves (e.g. 'AA') to simple numbers. I thought this would be a bigger advantage than it was, but it turns out Postgres is very efficient at storing and modifying simple text values.
- Keeping track of known sub-paths. As above, the cost of creating and accessing those was not worth it.
- Using a true recursive function. Way too slow and cumbersome.
AOC 2022 Day 16 Part Two
Part Two, as always, introduces a significant wrinkle to the original challenge. This time, we learn that we can train an elephant to also open valves. So the new goal is to find the maximum pressure released in 26 minutes if you and the elephant are simultaneously going through and opening valves! At first, this caused a little bit of a panic, at the thought of tracking two actors at once, which arrive at valves at different times, and keeping a common state of which valves are open.
However, after some thinking about this, I found a much better approach (this is a big spoiler, so see if you can think of a better solution yourself before going on).
- Because you and the elephant do not interact, and never open the same valves, we are examining two separate disjoint paths.
- We do not need to even run at the same time as the elephant. We both have optimal paths, so we could run a path, then the elephant could. Or vice-versa. Or at the same time. All that matters is the combined total.
- Since all we need is the highest combined value of two runs, we can look at all the possible paths that can occur given 26 minutes, then find the set of two non-overlapping ones that give the highest combined value.
- We know that both ourselves and the elephant will open roughly the same number of valves, which allows for some nice shortcuts.
So rather than a running total of the best time, we are going to need to store all the possible solutions and then pick out the top two such that no valve is opened twice. Our previous solution will need some modification. First, we need a place to store all the paths, so let's create a table for that:
CREATE UNLOGGED TABLE pathscore (
vpath text,
score smallint,
apath text[]
);
As usual, this will be an unlogged table as it is for holding temporary information. In this case, we will store the path in vpath
, the total score
, and later on, a sorted array of the vpath information. Next we need a function to walk through and populate the table:
CREATE OR REPLACE function walk_with_an_elephant
(startvalve text, timeleft smallint, openvalves text, currentcost int)
returns void volatile language plpgsql as $$
DECLARE
myrec RECORD; newtime smallint;
BEGIN
IF timeleft < 7 AND currentcost >= 1000 AND length(openvalves) BETWEEN 20 AND 25 THEN
INSERT INTO pathscore select openvalves, currentcost;
END IF;
FOR myrec IN SELECT finish,cost,rate FROM tubes WHERE start = startvalve AND timeleft > cost+1
AND position(' '||finish||' ' IN openvalves) = 0
LOOP
newtime = timeleft - myrec.cost - 1;
PERFORM walk_with_an_elephant(
myrec.finish,
newtime,
openvalves || ' ' || myrec.finish || ' ',
currentcost + (newtime * myrec.rate)
);
END LOOP;
RETURN;
END;
$$;
This function requires some explanations. At the top, we have:
IF timeleft < 7 AND currentcost >= 1000 AND length(openvalves) BETWEEN 20 AND 25 THEN
INSERT INTO pathscore select openvalves, currentcost;
END IF;
This is where we are storing the current path, represented by the same openvalves
text string, we used before, as well as the current score, which is an additional argument to our function. Because we want to limit the size of our pathscore
table as much as possible, and because each INSERT
slows things down quite a bit, there are some optimizations I made to speed things up:
timeleft < 7
ensures that we only store a path when we are nearing the end of our time. A manual look at the input file shows that all the costs for opening the valves are on the same order of magnitude, so there is no situation where ourselves or the elephant would ever open a few valves and then stop. All these optimizations required some deduction and some trial-and-error; but in a strict critical production situation we would want to loosen them up and be 100% sure we don't miss one of the best paths.currentcost >= 1000
ensures that we don't store any matching paths that are less than 1000 in score. RunningSELECT walkcost('AA', 26::smallint, ' ')
gives an answer of 1199, so it seems pretty safe to assume that there are at least two non-overlapping paths that both have a score of over 1000.BETWEEN 20 and 25
is a shortcut to exclude any paths that are too short or too long. In this case, as each valve is two letters, and surrounded by a space on each side, this is instructing the function to only insert paths that are between 6 and 8 valves long. This makes sense as we know that each valve costs at least one minute to move to, one to open, and there are 26 minutes in total.
The rest of the function is straightforward:
newtime = timeleft - myrec.cost - 1;
PERFORM walk_with_an_elephant(
myrec.finish,
newtime,
openvalves || ' ' || myrec.finish || ' ',
currentcost + (newtime * myrec.rate)
);
END LOOP;
Once again, we compute the new amount of time left to pass recursively into the function. This time, we don't need to return anything back up the stack, so we use PERFORM
to call the function and ignore any result. This is a quirk of PL/pgSQL that requires replacing SELECT
with PERFORM
if the return value is ignored. As before, we call the function with the new starting valve, the new time left, and a list of open valves. We also add in the current cost, as we need to know that as well.
Running it populates our table, and only takes 850ms:
SELECT walk_with_an_elephant('AA', 26::smallint, '', 0);
Even with all our optimizations, the pathscore table contains 28,719 rows! It looks like this:
SELECT * FROM pathscore ORDER BY random() LIMIT 5;
vpath | score | apath
--------------------------+-------+-------
GG DT KV HR NS RI | 1001 | ☃
KV NS HR YA RI | 1099 | ☃
KV DT SU GG NS HR | 1015 | ☃
RI KV YA NS SU GG | 1031 | ☃
RI NS YA HR KV RP | 1036 | ☃
(5 rows)
We need to be able to compare those vpaths and find ones that do not overlap each other. Sounds like a job for arrays! Let's populate the apath
column. We also want to sort the values, as without sorting the comparison is much slower.
UPDATE pathscore SET apath = array(
SELECT unnest(string_to_array(right(left(vpath,-1),-1), ' ')) ORDER BY 1
);
Now our pathscore table looks a little different:
SELECT * FROM pathscore ORDER BY random() LIMIT 5;
vpath | score | apath
--------------------------+-------+---------------------
GG SU DT NS KV RI | 1075 | {DT,GG,KV,NS,RI,SU}
RI KV NS HR SU GG | 1016 | {GG,HR,KV,NS,RI,SU}
SU DT KV NS YA HR | 1044 | {DT,HR,KV,NS,SU,YA}
KV NS YA GG RP | 1095 | {GG,KV,NS,RP,YA}
GG DT KV NS RI VT | 1014 | {DT,GG,KV,NS,RI,VT}
At this point, we are ready to solve the challenge. All we need is to find the two top highest paths which do not intersect each other. Here is some SQL to do that:
WITH x AS (select apath::text[], max(score) AS score FROM pathscore GROUP BY 1)
SELECT max(t1.score+t2.score) AS aoc_2022_day_16_part_two
FROM x t1, x t2
WHERE NOT t1.apath && t2.apath;
First we build the CTE named x
by looking at each possible combination of valves, and getting the top scoring one from each of those sets. In other words, the visiting order does not matter, only which valves opened. Then we query our CTE and join it to itself, such that no two paths have any of the same items - which is what the &&
operator does. Once the WHERE
clause gives us a combined list of all possible non-intersecting paths, we add up their costs and find the highest possible score:
aoc_2022_day_16_part_two
--------------------------
2223
(1 row)
Time: 10.143 ms
Voila! The answer to how much the pressure gets reduced when a human and an elephant work together to solve the TSP! :)
Related Articles
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read
- PostGIS Day 2024 Summary
8 min read
- Crunchy Data Warehouse: Postgres with Iceberg for High Performance Analytics
8 min read
- Loading the World! OpenStreetMap Import In Under 4 Hours
6 min read