Jean-Paul Argudo
13 min read
Related Articles
- 8 Steps in Writing Analytical SQL Queries
- 4 Ways to Create Date Bins in Postgres: interval, date_trunc, extract, and to_char
- Crunchy Postgres for Kubernetes 5.7: Faster Backups, Automated Snapshots, Postgres 17 and More
- pg_parquet: An Extension to Connect Postgres and Parquet
- Convert JSON into Columns and Rows with JSON_TABLE
How to Cheat at WORDLE with PostgreSQL
What is Wordle?
Wordle became very popular on the internet very quickly. It's a 5 letter word game to guess among all possible words in a given language. The French version is like the English one, except words aren't be written with the accents we have in French, like “é, è, ê, ë, à, ô,” etc. Words with the special character “œ” are written with 2 characters like “oe”.
Prepare the database for some fun
The French language has 7980 5-letter words. At least according to this website.
Yes, I did copy/paste every 21 pages of the words in a text file. With some vim commands I’ve been able to quickly format that file like this:
$ head -5 mots.txt
abaca
abale
abats
abbes
abces
I used a Crunchy Bridge instance to load the data and then I wrote a basic SQL script to create the table and add the words in it like this:
drop table if exists wordle;
create table wordle
(mot char(5) primary key);
\copy wordle from mots.txt
analyze wordle;
While I was trying to find the best ways, and there are many, to achieve this I found that decomposing each word letter by letter was useful for some statistics.
We could alter the table to create columns and store each character of the word, one by one, on each char(1). A long time ago I was taught never to store data in a database when you can have it otherwise. I did create a view like the following:
create view w5 as
select
mot
,substr(mot,1,1)::char(1) as c1
,substr(mot,2,1)::char(1) as c2
,substr(mot,3,1)::char(1) as c3
,substr(mot,4,1)::char(1) as c4
,substr(mot,5,1)::char(1) as c5
from
wordle;
I enforced ::char(1) as I know it will always be char(1) there. If you don’t do that, PostgreSQL will use text instead. I don't think there's any difference in performances there. But I prefer when data is stored as little as it can be and represented in the database in the smallest possible data type. Still, it’s a view, so there’s no extra storage here.
So here is how the data looks:
select * from w5 limit 5;
┌───────┬────┬────┬────┬────┬────┐
│ mot │ c1 │ c2 │ c3 │ c4 │ c5 │
├───────┼────┼────┼────┼────┼────┤
│ abaca │ a │ b │ a │ c │ a │
│ abale │ a │ b │ a │ l │ e │
│ abats │ a │ b │ a │ t │ s │
│ abbes │ a │ b │ b │ e │ s │
│ abces │ a │ b │ c │ e │ s │
└───────┴────┴────┴────┴────┴────┘
(5 rows)
Next step: starting to play
Now we can start to play and enter a first word. Everyone uses their “special first word” to start based on experience and habits.
Since I’m mostly a DBA, I want to optimize my chances. So I’ll find one of the words containing the most often used letters. Since I have a complete set of 5 character words in my database, I have to do some stats.
What are the most often used letters in my table?
The SQL trick here is using the function everyone should know : generate_series()
. This function will create an artificial series of numbers from start to end. Since ascii is a well organized thing, I know "a" is chr(97) and z is chr(122). So I first create an artificial table thanks to “with” syntax, to have all letters from the alphabet somewhere.
Then I count the occurrences of each letter in the list of words. If the letter is found in the word, it will count for one. If the word has multiple occurrences of the letter, like E is present three times in “epees”, it will still count for one. It doesn’t matter much here: the number of words is like 8000, all I’m searching for is statistics.
So the query looks like:
with lettres as
(select
-- chr(97) = 'a'
-- chr(122) = 'z'
chr(generate_series(97,122)) lettre)
select
lettre, count(*)
from
wordle, lettres
where
mot ~ lettre
group by 1
order by 2 desc
limit 5;
And gives this result:
┌────────┬───────┐
│ lettre │ count │
├────────┼───────┤
│ e │ 4338 │
│ a │ 4068 │
│ s │ 2994 │
│ i │ 2781 │
│ r │ 2476 │
└────────┴───────┘
(5 rows)
So letters E, A, S, I and R are the 5 most-used letters in the 5 letters words list.
Now I reuse the previous query to find a set of some candidate words to start with:
with
lettres as
(select
-- chr(97) = 'a'
-- chr(122) = 'z'
chr(generate_series(97,122)) lettre),
top5 as
(select
lettre as l
, count(*)
from
wordle, lettres
where
mot ~ lettre group by 1 order by 2 desc limit 5)
select
mot from
wordle where
mot ~ all(array(select l from top5));
The SQL trick used here is that ~ all(array(...))
.
It means that I want all the words containing that exact set of those 5 letters I qualified as “most-used letters” in the words set. I’ll use that trick all along the article because it is really handy. This gives me a list of 7 words:
┌───────┐
│ mot │
├───────┤
│ aires │
│ arise │
│ raies │
│ reais │
│ serai │
│ seria │
│ siera │
└───────┘
(7 rows)
Since I want to maximize my chances, I have to figure out which of the 7 will help me most.
There is some strategy here. And a lot of combinations. I am sure I could do even better to figure out what are my best chances. But I wondered how many words starts with A, R or S ?
That’s an opportunity to give an example of that marvelous FILTER
we have now in PostgreSQL, and also an example of the SIMILAR TO
usage:
select
'starting with a' as label, count(*) filter (where mot similar to 'a____'),
'starting with r' as label, count(*) filter (where mot similar to 'r____'),
'starting with s' as label, count(*) filter (where mot similar to 's____')
from wordle ;
┌─────────────────┬───────┬─────────────────┬───────┬─────────────────┬───────┐
│ label │ count │ label │ count │ label │ count │
├─────────────────┼───────┼─────────────────┼───────┼─────────────────┼───────┤
│ starting with a │ 531 │ starting with r │ 508 │ starting with s │ 595 │
└─────────────────┴───────┴─────────────────┴───────┴─────────────────┴───────┘
(1 rows)
So if I want to select a most representative word, it has to start with an S.
What about words endings? Would that make a difference to find a better suited word to start?
\x
select
'end with s' as label, count(*) filter (where mot similar to '____s'),
'end with a' as label, count(*) filter (where mot similar to '____a'),
'end with i' as label, count(*) filter (where mot similar to '____i'),
'end with e' as label, count(*) filter (where mot similar to '____e')
from wordle ;
┌─[ RECORD 1 ]───────┐
│ label │ end with s │
│ count │ 2036 │
│ label │ end with a │
│ count │ 932 │
│ label │ end with i │
│ count │ 438 │
│ label │ end with e │
│ count │ 1923 │
└───────┴────────────┘
Wow! So words ending with an S are 2036 out of ~8000. Thats a 1⁄4 the words. So I taught myself that taking one of the word ending with an S would be a best fit.
Thanks to the first query, starting with an A would be even better since it cannot start and end with the same "S" letter. So in that list, there’s only one relevant word here:
│ aires │
│ arise │
│ raies │
│ reais │
│ serai │
│ seria │
│ siera │
It was the first in the list. Pure coincidence.
So the best word to start given those small stats set is “aires”. Again, I am sure one can do even better stats to find a better suited word to start, depends your strategy. My strategy is to eliminate as many words as I can at each round.
First iteration
As per 28th Feb 2022 French wordle
The 1st iteration gives this:
So, there’s no A or S in today’s wordle. But there is at least one I, one R and one E.
Let’s reduce possibilities knowing this.
So the query will look like:
select
count(*) --because on 1st iteration, there will be lots of possibilities
from
wordle
where
-- letters in the word
mot ~ all(array['i','r','e']) and
-- letters not in the word
mot !~ all(array['a','s']) and
-- letters we found and we know they are misplaced
mot not similar to ('_i___|__r__|___e_');
┌───────┐
│ count │
├───────┤
│ 92 │
└───────┘
(1 row)
Using the view makes something different:
select
count(*) --because on 1st iteration, there will be lots of possibilities
from
w5
where
-- letters in the word
mot ~ all(array['i','r','e']) and
-- letters not in the word
mot !~ all(array['a','s']) and
-- letters we found and we know they are misplaced
c2 <> 'i' and
c3 <> 'r' and
c4 <> 'e'
;
┌───────┐
│ count │
├───────┤
│ 92 │
└───────┘
(1 row)
2nd iteration
Now we have to find another “best pick” for a 2nd word. There will be many strategies here, and I don’t pretend mine is best. After all, this article is just to give some tricks in SQL.
It doesn’t have any of A or S, but has at least one I, one R and one E.
Let’s find a word with the 3 next most common found letters in those 92 words.
with lettres as (
select
chr(generate_series(97,122)) lettre),
words as
(select
mot
from
w5
where
mot ~ all(array['i','r','e']) and
mot !~ all(array['a','s']) and
c2 <> 'i' and
c3 <> 'r' and
c4 <> 'e'
)
select
lettre, count(*)
from
words, lettres
where
mot ~ lettre
group by 1
order by 2 desc
limit 10;
┌────────┬───────┐
│ lettre │ count │
├────────┼───────┤
│ e │ 92 │
│ r │ 92 │
│ i │ 92 │
│ t │ 22 │
│ o │ 17 │
│ u │ 14 │
│ m │ 14 │
│ n │ 14 │
│ f │ 13 │
│ g │ 12 │
└────────┴───────┘
(10 rows)
Let’s pick a word containing our I, E, R and also T and O :
select mot from wordle
where mot ~* all(array['i','e','r','t','o']);
┌───────┐
│ mot │
├───────┤
│ ortie │
│ riote │
│ rotie │
└───────┘
(3 rows)
I could do more stats to find which is better to chose.
We see right away 2nd word “RIOTE” can’t be the word we’re searching, since it has “I” in 2nd position we know it’s impossible. So the choice has to be between ORTIE and ROTIE. Let’s choose ORTIE then...
2nd iteration results
So, we found another letter in the word : O.
We find also the right place of “at least one” E in the last position.
Let’s refine the 1st iteration query now we have another letter O. And now we know at least one E has to be in 5th position, and that there’s no A, S or T:
select
mot
from
w5
where
mot ~ all(array['o','i','r','e']) and
mot !~ all(array['a','s','t']) and
c2 not in ('i','r') and
c3 <> 'r' and
c4 not in ('e','i') and
c5 = 'e'
;
┌───────┐
│ mot │
├───────┤
│ boire │
│ foire │
│ moire │
│ noire │
│ poire │
│ roide │
│ voire │
└───────┘
(7 rows)
Now the choice is reduced to 7 possibilities. That’s quite unlucky: we know for sure the solution has to end with “OIRE”… So basically we’ll have to choose one letter from B, F, M, N P, R or V.
Let’s try to keep focused on the method: what’s the most common letter in words that have O, I, R and E in the word (not only ending with OIRE!) in French, without any A, T or S in it?
We will get rid of the “ending with E” to cover more words, so we have better stats.
with lettres as (
select
chr(generate_series(97,122)) lettre),
words as
(select
mot
from
w5
where
mot ~ all(array['o','i','r','e']) and
mot !~ all(array['a','t','s'])
)
select
lettre, count(*)
from
words, lettres
where
mot ~ lettre
group by 1
order by 2 desc
limit 10;
┌────────┬───────┐
│ lettre │ count │
├────────┼───────┤
│ i │ 20 │
│ o │ 20 │
│ e │ 20 │
│ r │ 20 │
│ b │ 5 │
│ v │ 2 │
│ d │ 2 │
│ g │ 2 │
│ n │ 2 │
│ p │ 2 │
└────────┴───────┘
(10 rows)
So that’s B !
Let’s add that B on OIRE, to have the word of 3rd iteration : BOIRE (means “to drink”).
3rd iteration result
So about 7 possibilities, we picked the most often letter in words having any of O, I, R and E, and not having any A, T and S. That letter was B, and so we found the right word on the fly.
Conclusion
This article has only one purpose and it wasn’t about cheating. I encourage you not to because cheating ruins any game.
It’s about showing you a bit of some more advanced techniques about pattern matching in PostgreSQL.
Note that I could have written queries without the need of cutting words in letters using that view I created. It’s your choice. This also shows how versatile PostgreSQL is in giving you the choice of doing what bests fits and what you find more suitable or maintainable, etc.
Here’s a final example of equal results queries. One is using c1..c2 method, the other only deals with the word column:
|
|
Both queries are equivalent in terms of results, but one can figure out PostgreSQL does better with those atomic comparisons with c1..c5 than it does with “regexp” like version in terms of performance.
Obviously, this will be of interest with a huge collection of words or everything else, and doesn’t mean much on a 8000 tuples table. I really hope you had as much fun to read this article than I had writing it. And hope you were able to learn something from it!
Thanks to David Christensen and Greg Sabino Mullane from Crunchy Data who gave me some tricks that I used in this article.
Related Articles
- 8 Steps in Writing Analytical SQL Queries
8 min read
- 4 Ways to Create Date Bins in Postgres: interval, date_trunc, extract, and to_char
8 min read
- Crunchy Postgres for Kubernetes 5.7: Faster Backups, Automated Snapshots, Postgres 17 and More
4 min read
- pg_parquet: An Extension to Connect Postgres and Parquet
4 min read
- Convert JSON into Columns and Rows with JSON_TABLE
5 min read