What is Wordle?
Wordle became very popular on the internet very quickly. It's a 5 letter word game to guess among all possible words in a given language. The French version is like the English one, except words aren't be written with the accents we have in French, like “é, è, ê, ë, à, ô,” etc. Words with the special character “œ” are written with 2 characters like “oe”.
Prepare the database for some fun
The French language has 7980 5-letter words. At least according to this website.
Yes, I did copy/paste every 21 pages of the words in a text file. With some vim commands I’ve been able to quickly format that file like this:
$ head -5 mots.txt abaca abale abats abbes abces
I used a Crunchy Bridge instance to load the data and then I wrote a basic SQL script to create the table and add the words in it like this:
drop table if exists wordle; create table wordle (mot char(5) primary key); \copy wordle from mots.txt analyze wordle;
While I was trying to find the best ways, and there are many, to achieve this I found that decomposing each word letter by letter was useful for some statistics.
We could alter the table to create columns and store each character of the word, one by one, on each char(1). A long time ago I was taught never to store data in a database when you can have it otherwise. I did create a view like the following:
create view w5 as select mot ,substr(mot,1,1)::char(1) as c1 ,substr(mot,2,1)::char(1) as c2 ,substr(mot,3,1)::char(1) as c3 ,substr(mot,4,1)::char(1) as c4 ,substr(mot,5,1)::char(1) as c5 from wordle;
I enforced ::char(1) as I know it will always be char(1) there. If you don’t do that, PostgreSQL will use text instead. I don't think there's any difference in performances there. But I prefer when data is stored as little as it can be and represented in the database in the smallest possible data type. Still, it’s a view, so there’s no extra storage here.
So here is how the data looks:
select * from w5 limit 5; ┌───────┬────┬────┬────┬────┬────┐ │ mot │ c1 │ c2 │ c3 │ c4 │ c5 │ ├───────┼────┼────┼────┼────┼────┤ │ abaca │ a │ b │ a │ c │ a │ │ abale │ a │ b │ a │ l │ e │ │ abats │ a │ b │ a │ t │ s │ │ abbes │ a │ b │ b │ e │ s │ │ abces │ a │ b │ c │ e │ s │ └───────┴────┴────┴────┴────┴────┘ (5 rows)
Next step: starting to play
Now we can start to play and enter a first word. Everyone uses their “special first word” to start based on experience and habits.
Since I’m mostly a DBA, I want to optimize my chances. So I’ll find one of the words containing the most often used letters. Since I have a complete set of 5 character words in my database, I have to do some stats.
What are the most often used letters in my table?
The SQL trick here is using the function everyone should know :
generate_series(). This function will create an artificial series of numbers
from start to end. Since ascii is a well organized thing, I know "a" is chr(97)
and z is chr(122). So I first create an artificial table thanks to “with”
syntax, to have all letters from the alphabet somewhere.
Then I count the occurrences of each letter in the list of words. If the letter is found in the word, it will count for one. If the word has multiple occurrences of the letter, like E is present three times in “epees”, it will still count for one. It doesn’t matter much here: the number of words is like 8000, all I’m searching for is statistics.
So the query looks like:
with lettres as (select -- chr(97) = 'a' -- chr(122) = 'z' chr(generate_series(97,122)) lettre) select lettre, count(*) from wordle, lettres where mot ~ lettre group by 1 order by 2 desc limit 5;
And gives this result:
┌────────┬───────┐ │ lettre │ count │ ├────────┼───────┤ │ e │ 4338 │ │ a │ 4068 │ │ s │ 2994 │ │ i │ 2781 │ │ r │ 2476 │ └────────┴───────┘ (5 rows)
So letters E, A, S, I and R are the 5 most-used letters in the 5 letters words list.
Now I reuse the previous query to find a set of some candidate words to start with:
with lettres as (select -- chr(97) = 'a' -- chr(122) = 'z' chr(generate_series(97,122)) lettre), top5 as (select lettre as l , count(*) from wordle, lettres where mot ~ lettre group by 1 order by 2 desc limit 5) select mot from wordle where mot ~ all(array(select l from top5));
The SQL trick used here is that ~
It means that I want all the words containing that exact set of those 5 letters I qualified as “most-used letters” in the words set. I’ll use that trick all along the article because it is really handy. This gives me a list of 7 words:
┌───────┐ │ mot │ ├───────┤ │ aires │ │ arise │ │ raies │ │ reais │ │ serai │ │ seria │ │ siera │ └───────┘ (7 rows)
Since I want to maximize my chances, I have to figure out which of the 7 will help me most.
There is some strategy here. And a lot of combinations. I am sure I could do even better to figure out what are my best chances. But I wondered how many words starts with A, R or S ?
That’s an opportunity to give an example of that marvelous
FILTER we have now
in PostgreSQL, and also an example of the
SIMILAR TO usage:
select 'starting with a' as label, count(*) filter (where mot similar to 'a____'), 'starting with r' as label, count(*) filter (where mot similar to 'r____'), 'starting with s' as label, count(*) filter (where mot similar to 's____') from wordle ; ┌─────────────────┬───────┬─────────────────┬───────┬─────────────────┬───────┐ │ label │ count │ label │ count │ label │ count │ ├─────────────────┼───────┼─────────────────┼───────┼─────────────────┼───────┤ │ starting with a │ 531 │ starting with r │ 508 │ starting with s │ 595 │ └─────────────────┴───────┴─────────────────┴───────┴─────────────────┴───────┘ (1 rows)
So if I want to select a most representative word, it has to start with an S.
What about words endings? Would that make a difference to find a better suited word to start?
\x select 'end with s' as label, count(*) filter (where mot similar to '____s'), 'end with a' as label, count(*) filter (where mot similar to '____a'), 'end with i' as label, count(*) filter (where mot similar to '____i'), 'end with e' as label, count(*) filter (where mot similar to '____e') from wordle ; ┌─[ RECORD 1 ]───────┐ │ label │ end with s │ │ count │ 2036 │ │ label │ end with a │ │ count │ 932 │ │ label │ end with i │ │ count │ 438 │ │ label │ end with e │ │ count │ 1923 │ └───────┴────────────┘
Wow! So words ending with an S are 2036 out of ~8000. Thats a 1⁄4 the words. So I taught myself that taking one of the word ending with an S would be a best fit.
Thanks to the first query, starting with an A would be even better since it cannot start and end with the same "S" letter. So in that list, there’s only one relevant word here:
│ aires │ │ arise │ │ raies │ │ reais │ │ serai │ │ seria │ │ siera │
It was the first in the list. Pure coincidence.
So the best word to start given those small stats set is “aires”. Again, I am sure one can do even better stats to find a better suited word to start, depends your strategy. My strategy is to eliminate as many words as I can at each round.
As per 28th Feb 2022 French wordle
The 1st iteration gives this:
So, there’s no A or S in today’s wordle. But there is at least one I, one R and one E.
Let’s reduce possibilities knowing this.
So the query will look like:
select count(*) --because on 1st iteration, there will be lots of possibilities from wordle where -- letters in the word mot ~ all(array['i','r','e']) and -- letters not in the word mot !~ all(array['a','s']) and -- letters we found and we know they are misplaced mot not similar to ('_i___|__r__|___e_'); ┌───────┐ │ count │ ├───────┤ │ 92 │ └───────┘ (1 row)
Using the view makes something different:
select count(*) --because on 1st iteration, there will be lots of possibilities from w5 where -- letters in the word mot ~ all(array['i','r','e']) and -- letters not in the word mot !~ all(array['a','s']) and -- letters we found and we know they are misplaced c2 <> 'i' and c3 <> 'r' and c4 <> 'e' ; ┌───────┐ │ count │ ├───────┤ │ 92 │ └───────┘ (1 row)
Now we have to find another “best pick” for a 2nd word. There will be many strategies here, and I don’t pretend mine is best. After all, this article is just to give some tricks in SQL.
It doesn’t have any of A or S, but has at least one I, one R and one E.
Let’s find a word with the 3 next most common found letters in those 92 words.
with lettres as ( select chr(generate_series(97,122)) lettre), words as (select mot from w5 where mot ~ all(array['i','r','e']) and mot !~ all(array['a','s']) and c2 <> 'i' and c3 <> 'r' and c4 <> 'e' ) select lettre, count(*) from words, lettres where mot ~ lettre group by 1 order by 2 desc limit 10; ┌────────┬───────┐ │ lettre │ count │ ├────────┼───────┤ │ e │ 92 │ │ r │ 92 │ │ i │ 92 │ │ t │ 22 │ │ o │ 17 │ │ u │ 14 │ │ m │ 14 │ │ n │ 14 │ │ f │ 13 │ │ g │ 12 │ └────────┴───────┘ (10 rows)
Let’s pick a word containing our I, E, R and also T and O :
select mot from wordle where mot ~* all(array['i','e','r','t','o']); ┌───────┐ │ mot │ ├───────┤ │ ortie │ │ riote │ │ rotie │ └───────┘ (3 rows)
I could do more stats to find which is better to chose.
We see right away 2nd word “RIOTE” can’t be the word we’re searching, since it has “I” in 2nd position we know it’s impossible. So the choice has to be between ORTIE and ROTIE. Let’s choose ORTIE then...
2nd iteration results
So, we found another letter in the word : O.
We find also the right place of “at least one” E in the last position.
Let’s refine the 1st iteration query now we have another letter O. And now we know at least one E has to be in 5th position, and that there’s no A, S or T:
select mot from w5 where mot ~ all(array['o','i','r','e']) and mot !~ all(array['a','s','t']) and c2 not in ('i','r') and c3 <> 'r' and c4 not in ('e','i') and c5 = 'e' ; ┌───────┐ │ mot │ ├───────┤ │ boire │ │ foire │ │ moire │ │ noire │ │ poire │ │ roide │ │ voire │ └───────┘ (7 rows)
Now the choice is reduced to 7 possibilities. That’s quite unlucky: we know for sure the solution has to end with “OIRE”… So basically we’ll have to choose one letter from B, F, M, N P, R or V.
Let’s try to keep focused on the method: what’s the most common letter in words that have O, I, R and E in the word (not only ending with OIRE!) in French, without any A, T or S in it?
We will get rid of the “ending with E” to cover more words, so we have better stats.
with lettres as ( select chr(generate_series(97,122)) lettre), words as (select mot from w5 where mot ~ all(array['o','i','r','e']) and mot !~ all(array['a','t','s']) ) select lettre, count(*) from words, lettres where mot ~ lettre group by 1 order by 2 desc limit 10; ┌────────┬───────┐ │ lettre │ count │ ├────────┼───────┤ │ i │ 20 │ │ o │ 20 │ │ e │ 20 │ │ r │ 20 │ │ b │ 5 │ │ v │ 2 │ │ d │ 2 │ │ g │ 2 │ │ n │ 2 │ │ p │ 2 │ └────────┴───────┘ (10 rows)
So that’s B !
Let’s add that B on OIRE, to have the word of 3rd iteration : BOIRE (means “to drink”).
3rd iteration result
So about 7 possibilities, we picked the most often letter in words having any of O, I, R and E, and not having any A, T and S. That letter was B, and so we found the right word on the fly.
This article has only one purpose and it wasn’t about cheating. I encourage you not to because cheating ruins any game.
It’s about showing you a bit of some more advanced techniques about pattern matching in PostgreSQL.
Note that I could have written queries without the need of cutting words in letters using that view I created. It’s your choice. This also shows how versatile PostgreSQL is in giving you the choice of doing what bests fits and what you find more suitable or maintainable, etc.
Here’s a final example of equal results queries. One is using c1..c2 method, the other only deals with the word column:
Both queries are equivalent in terms of results, but one can figure out PostgreSQL does better with those atomic comparisons with c1..c5 than it does with “regexp” like version in terms of performance.
Obviously, this will be of interest with a huge collection of words or everything else, and doesn’t mean much on a 8000 tuples table. I really hope you had as much fun to read this article than I had writing it. And hope you were able to learn something from it!
March 4, 2022 •More by this author