When Wordle first went viral I was in a group chat at work and had the idea of making a Python tutorial based on Wordle. I never did it, but this post is that introduction. I will likely do another introduction to R post as well.
The NY Times has changed the source code for Wordle (likely in response to security concerns?) but the Web Archive still has the old information, including the source code and the Javascript file. Thanks to this link for the info.
.
My idea for the tutorial is to do some analysis on the answers and the possible words. Maybe we can figure out some good starting words, find out the letter frequencies, that kind of thing.
First I want to read in the javascript file and get the word lists into variables. Somewhat surprisingly these were always available! Sitting right there in the code.
# with open() as f: is standard memory management to open and then close the file
# This is the path to the file, before we publish
# 'r' to read the file
# f.read() because we are fine with one big string of contents
with open('../../../static/data/main.e65ce0a5.js', 'r') as f:
js = f.read()
I get an error for an unexpected character:
Traceback (most recent call last):
File "C:\Users\NATHAN~1\AppData\Local\Temp\Rtmpc1KPgv\chunk-code-7da05d317d7a.txt", line 3, in <module>
js = f.read()
^^^^^^^^
File "C:\Program Files\Python312\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 142756: character maps to <undefined>
When I look up position 142756 in the file, I find that it’s a backspace character, and the enter character is there just afterwards.
var s=a.detail.key;"←"===s||"Backspace"===s?e.removeLetter():"↵"===s||"Enter"===s?e.submitGuess()
Let’s try to open with different encodings. My first guess of utf-8 worked!
On review, this might be because I did not load the reticulate package. This error doesn’t come up any more.
with open('../../../static/data/main.e65ce0a5.js', 'r', encoding='utf-8') as f:
js = f.read()
Now, from looking at the Javascript file I know that the answers are contained in a variable called La and the possible words in Ta. I can also quickly search and see that La= and Ta= only appear once each in the file so I can use regex to extract those strings and then turn them into proper python lists.
I also know that “shave” is the last Wordle answer, but that is some years away. Spoilers, sorry.
# import regex
import re
# La= is the identifier for the string
# \[ want to find a literal square open bracket
# \] want to find a literal square close bracket
# (.*?) want to find the one or more of anything in a non-greedy way
# [0] findall returns a list so get the first element
answer_string = re.findall(r'La=\[(.*?)\]', js)[0]
possible_string = re.findall(r'Ta=\[(.*?)\]', js)[0]
ans_len = len(answer_string)
poss_len = len(possible_string)
I have long strings: answer_string is 18519 characters long, and possible_string is 85255 characters long. I need to make them lists.
Note: displaying python variables in rmarkdown and this answer.
# the strings come along with the quotes from the Javascript, replace them with nothing
# split by comma into a list
answer_list = answer_string.replace('"', '').split(',')
possible_list = possible_string.replace('"', '').split(',')
ans_len = len(answer_list)
poss_len = len(possible_list)
Now I have lists to work with: answer_list is 2315 words long, and possible_list is 10657 words long.
These were the first ten Wordle answers, in case you are interested.
# use [:10] to get the first ten elements from a list
# indexes start at 0, so the slicer is not inclusive
answer_list[:10]
## ['cigar', 'rebut', 'sissy', 'humph', 'awake', 'blush', 'focal', 'evade', 'naval', 'serve']
Is my starting word a possible answer? Will I ever get a sacred 1?
'farts' in answer_list
## False
No. That is fair enough.
So, what is a good starting word? Let’s count the number of letters in each position, and the number overall
import collections
import pprint
ans1 = collections.Counter([a[0] for a in answer_list])
ans2 = collections.Counter([a[1] for a in answer_list])
ans3 = collections.Counter([a[2] for a in answer_list])
ans4 = collections.Counter([a[3] for a in answer_list])
ans5 = collections.Counter([a[4] for a in answer_list])
ans_all = collections.Counter(''.join(answer_list))
pprint.pp(ans_all)
## Counter({'e': 1233,
## 'a': 979,
## 'r': 899,
## 'o': 754,
## 't': 729,
## 'l': 719,
## 'i': 671,
## 's': 669,
## 'n': 575,
## 'c': 477,
## 'u': 467,
## 'y': 425,
## 'd': 393,
## 'h': 389,
## 'p': 367,
## 'm': 316,
## 'g': 311,
## 'b': 281,
## 'f': 230,
## 'k': 210,
## 'w': 195,
## 'v': 153,
## 'z': 40,
## 'x': 37,
## 'q': 29,
## 'j': 27})
E, A, R, O, T are the most popular five letters. I can’t think of any way to combine those into an actual word though.
What are the most popular letters in each position?
import pandas as pd
# we need to sort the dictionaries first
ans1 = dict(sorted(ans1.items(), key=lambda item: item[1], reverse=True))
ans2 = dict(sorted(ans2.items(), key=lambda item: item[1], reverse=True))
ans3 = dict(sorted(ans3.items(), key=lambda item: item[1], reverse=True))
ans4 = dict(sorted(ans4.items(), key=lambda item: item[1], reverse=True))
ans5 = dict(sorted(ans5.items(), key=lambda item: item[1], reverse=True))
# convert dictionary items into columns
position_counts = pd.DataFrame(ans1.items(), columns=['L1', 'C1'])
# There are no words starting with X!
# add a zero entry
position_counts.loc[len(position_counts.index)] = ['x', 0]
# temp dataframes
temp_df = pd.DataFrame(ans2.items(), columns=['L2', 'C2'])
# we need to check the lengths because joining on index assumes equal lengths
len(temp_df)
## 26
# join to the counter dataframe
position_counts = position_counts.join(temp_df[['L2', 'C2']].set_axis(position_counts.index))
# i'm sure this could be more efficient but it's what I've got
temp_df = pd.DataFrame(ans3.items(), columns=['L3', 'C3'])
len(temp_df)
## 26
position_counts = position_counts.join(temp_df[['L3', 'C3']].set_axis(position_counts.index))
# position 4 is missing q
temp_df = pd.DataFrame(ans4.items(), columns=['L4', 'C4'])
len(temp_df)
## 25
temp_df
## L4 C4
## 0 e 318
## 1 n 182
## 2 s 171
## 3 a 163
## 4 l 162
## 5 i 158
## 6 r 152
## 7 c 152
## 8 t 139
## 9 o 132
## 10 u 82
## 11 g 76
## 12 d 69
## 13 m 68
## 14 k 55
## 15 p 50
## 16 v 46
## 17 f 35
## 18 h 28
## 19 w 25
## 20 b 24
## 21 z 20
## 22 x 3
## 23 y 3
## 24 j 2
temp_df.loc[len(temp_df.index)] = ['q', 0]
position_counts = position_counts.join(temp_df[['L4', 'C4']].set_axis(position_counts.index))
# position 5 is missing j, q, v
temp_df = pd.DataFrame(ans5.items(), columns=['L5', 'C5'])
len(temp_df)
## 23
temp_df
## L5 C5
## 0 e 424
## 1 y 364
## 2 t 253
## 3 r 212
## 4 l 156
## 5 h 139
## 6 n 130
## 7 d 118
## 8 k 113
## 9 a 64
## 10 o 58
## 11 p 56
## 12 m 42
## 13 g 41
## 14 s 36
## 15 c 31
## 16 f 26
## 17 w 17
## 18 b 11
## 19 i 11
## 20 x 8
## 21 z 4
## 22 u 1
temp_df.loc[len(temp_df.index)] = ['j', 0]
temp_df.loc[len(temp_df.index)] = ['q', 0]
temp_df.loc[len(temp_df.index)] = ['v', 0]
position_counts = position_counts.join(temp_df[['L5', 'C5']].set_axis(position_counts.index))
# there's probably a way to start with the letters and count them with collections
position_counts
## L1 C1 L2 C2 L3 C3 L4 C4 L5 C5
## 0 s 366 a 304 a 307 e 318 e 424
## 1 c 198 o 279 i 266 n 182 y 364
## 2 b 173 r 267 o 244 s 171 t 253
## 3 t 149 e 242 e 177 a 163 r 212
## 4 p 142 i 202 u 165 l 162 l 156
## 5 a 141 l 201 r 163 i 158 h 139
## 6 f 136 u 186 n 139 r 152 n 130
## 7 g 115 h 144 l 112 c 152 d 118
## 8 d 111 n 87 t 111 t 139 k 113
## 9 m 107 t 77 s 80 o 132 a 64
## 10 r 105 p 61 d 75 u 82 o 58
## 11 l 88 w 44 g 67 g 76 p 56
## 12 w 83 c 40 m 61 d 69 m 42
## 13 e 72 m 38 p 58 m 68 g 41
## 14 h 69 y 23 b 57 k 55 s 36
## 15 v 43 d 20 c 56 p 50 c 31
## 16 o 41 b 16 v 49 v 46 f 26
## 17 n 37 s 16 y 29 f 35 w 17
## 18 i 34 v 15 w 26 h 28 b 11
## 19 u 33 x 14 f 25 w 25 i 11
## 20 q 23 g 12 x 12 b 24 x 8
## 21 k 20 k 10 k 12 z 20 z 4
## 22 j 20 f 8 z 11 x 3 u 1
## 23 y 6 q 5 h 9 y 3 j 0
## 24 z 3 j 2 j 3 j 2 q 0
## 25 x 0 z 2 q 1 q 0 v 0
Thoughts from these counts:
# combine frequencies into one list
list_of_freqs = [ans1, ans2, ans3, ans4, ans5]
# function for calculating score
def create_score(word):
score = 0
for w, s in zip(list(word), list_of_freqs):
score += s[w]
return score
# new dictionary with words as keys, and scores as values
word_score = {x: create_score(x) for x in answer_list}
# sort the dictionary
word_score = dict(sorted(word_score.items(), key=lambda item: item[1], reverse=True))
# print out the top 20 words
pprint.pp(list(word_score.items())[:15])
## [('slate', 1437),
## ('sauce', 1411),
## ('slice', 1409),
## ('shale', 1403),
## ('saute', 1398),
## ('share', 1393),
## ('sooty', 1392),
## ('shine', 1382),
## ('suite', 1381),
## ('crane', 1378),
## ('saint', 1371),
## ('soapy', 1366),
## ('shone', 1360),
## ('shire', 1352),
## ('saucy', 1351)]
How do my suggestions hold up? SLATE is number one!
print('stare: ' + str(word_score['stare']))
## stare: 1326
print('store: ' + str(word_score['store']))
## store: 1263
STARE is reasonable but STORE is lower down. I’m also kind of pleased that there aren’t many words that use the overall frequency leaders, it’s not that I can’t think of any, we often have to dip into c, h, n.
Maybe we could do letter combinations? SH is a popular starting combo. We can’t combine these in the same way as individual letters though: first and second letter plus second and third letter plus third and fourth plus four and five need to be combined in a cohesive way.
starter = collections.Counter([a[:2] for a in answer_list])
finisher = collections.Counter([a[3:] for a in answer_list])
print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('st', 65),
## ('sh', 52),
## ('cr', 45),
## ('sp', 45),
## ('ch', 40),
## ('gr', 38),
## ('re', 36),
## ('fl', 36),
## ('tr', 36),
## ('br', 35),
## ('ma', 34),
## ('bl', 32),
## ('ca', 32),
## ('cl', 31),
## ('mo', 29)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('er', 141),
## ('ch', 58),
## ('ly', 56),
## ('se', 52),
## ('al', 50),
## ('ck', 47),
## ('ty', 46),
## ('te', 39),
## ('el', 38),
## ('dy', 38),
## ('ng', 38),
## ('ge', 38),
## ('ve', 37),
## ('nt', 37),
## ('th', 36)]
ER is vastly more popular than any other starting or ending combination. Interesting to see not so popular letters make an appearance for starting, like GR and FL
Let’s try again with more letters!
starter = collections.Counter([a[:3] for a in answer_list])
finisher = collections.Counter([a[2:] for a in answer_list])
print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('sta', 19),
## ('sha', 18),
## ('sto', 16),
## ('gra', 15),
## ('cha', 13),
## ('cra', 12),
## ('spi', 12),
## ('sho', 12),
## ('pri', 11),
## ('bra', 11),
## ('cre', 11),
## ('tra', 10),
## ('bri', 10),
## ('ste', 10),
## ('gro', 10)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('ing', 23),
## ('lly', 22),
## ('tch', 18),
## ('ter', 16),
## ('ack', 15),
## ('nch', 14),
## ('tty', 14),
## ('ver', 14),
## ('rry', 13),
## ('unt', 13),
## ('ash', 13),
## ('ank', 12),
## ('dge', 12),
## ('ate', 11),
## ('ide', 11)]
starter = collections.Counter([a[:4] for a in answer_list])
finisher = collections.Counter([a[1:] for a in answer_list])
print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('basi', 4),
## ('stee', 4),
## ('stor', 4),
## ('brin', 4),
## ('shar', 4),
## ('stea', 4),
## ('scal', 4),
## ('cove', 4),
## ('mang', 4),
## ('shee', 4),
## ('spoo', 4),
## ('stin', 3),
## ('stoo', 3),
## ('fort', 3),
## ('star', 3)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('ight', 9),
## ('ound', 8),
## ('ower', 7),
## ('atch', 7),
## ('atty', 6),
## ('aunt', 6),
## ('aste', 6),
## ('illy', 6),
## ('usty', 5),
## ('ying', 5),
## ('each', 5),
## ('ater', 5),
## ('ully', 5),
## ('ough', 5),
## ('andy', 5)]
The ending combinations are definitely the ones to watch out for. There’s the infamous IGHT but also OUND, OWER, and ATCH.
I wish the Twitter API was still available to see how many fails there are on days where those combinations occur.
Thanks to the Web Archive, I’m able to get the old version of the Wordle code, and do some exploration to try to find a good starting word. Python list comprehensions are cool; The Collections package was super helpful; I should definitely think about combining data structures instead of having five different dictionaries; functions can be easy; SLATE is a good starting word.