An introduction to Python via Wordle

When Wordle first went viral I was in a group chat at work and had the idea of making a Python tutorial based on Wordle. I never did it, but this post is that introduction. I will likely do another introduction to R post as well.

Intro

The NY Times has changed the source code for Wordle (likely in response to security concerns?) but the Web Archive still has the old information, including the source code and the Javascript file. Thanks to this link for the info.

.

My idea for the tutorial is to do some analysis on the answers and the possible words. Maybe we can figure out some good starting words, find out the letter frequencies, that kind of thing.

First I want to read in the javascript file and get the word lists into variables. Somewhat surprisingly these were always available! Sitting right there in the code.


# with open() as f: is standard memory management to open and then close the file
# This is the path to the file, before we publish
# 'r' to read the file
# f.read() because we are fine with one big string of contents
with open('../../../static/data/main.e65ce0a5.js', 'r') as f:
    js = f.read()

I get an error for an unexpected character:

Traceback (most recent call last):
  File "C:\Users\NATHAN~1\AppData\Local\Temp\Rtmpc1KPgv\chunk-code-7da05d317d7a.txt", line 3, in <module>
    js = f.read()
         ^^^^^^^^
  File "C:\Program Files\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 142756: character maps to <undefined>

When I look up position 142756 in the file, I find that it’s a backspace character, and the enter character is there just afterwards.

var s=a.detail.key;"←"===s||"Backspace"===s?e.removeLetter():"↵"===s||"Enter"===s?e.submitGuess()

Let’s try to open with different encodings. My first guess of utf-8 worked!

On review, this might be because I did not load the reticulate package. This error doesn’t come up any more.


with open('../../../static/data/main.e65ce0a5.js', 'r', encoding='utf-8') as f:
    js = f.read()

Now, from looking at the Javascript file I know that the answers are contained in a variable called La and the possible words in Ta. I can also quickly search and see that La= and Ta= only appear once each in the file so I can use regex to extract those strings and then turn them into proper python lists.

I also know that “shave” is the last Wordle answer, but that is some years away. Spoilers, sorry.

# import regex
import re

# La= is the identifier for the string
# \[ want to find a literal square open bracket
# \] want to find a literal square close bracket
# (.*?) want to find the one or more of anything in a non-greedy way
# [0] findall returns a list so get the first element
answer_string = re.findall(r'La=\[(.*?)\]', js)[0]
possible_string = re.findall(r'Ta=\[(.*?)\]', js)[0]

ans_len = len(answer_string)
poss_len = len(possible_string)

I have long strings: answer_string is 18519 characters long, and possible_string is 85255 characters long. I need to make them lists.

Note: displaying python variables in rmarkdown and this answer.


# the strings come along with the quotes from the Javascript, replace them with nothing
# split by comma into a list
answer_list = answer_string.replace('"', '').split(',')
possible_list = possible_string.replace('"', '').split(',')

ans_len = len(answer_list)
poss_len = len(possible_list)

Now I have lists to work with: answer_list is 2315 words long, and possible_list is 10657 words long.

Basics

These were the first ten Wordle answers, in case you are interested.

# use [:10] to get the first ten elements from a list
# indexes start at 0, so the slicer is not inclusive
answer_list[:10]
## ['cigar', 'rebut', 'sissy', 'humph', 'awake', 'blush', 'focal', 'evade', 'naval', 'serve']

Is my starting word a possible answer? Will I ever get a sacred 1?

'farts' in answer_list
## False

No. That is fair enough.

Counting

So, what is a good starting word? Let’s count the number of letters in each position, and the number overall

import collections
import pprint

ans1 = collections.Counter([a[0] for a in answer_list])
ans2 = collections.Counter([a[1] for a in answer_list])
ans3 = collections.Counter([a[2] for a in answer_list])
ans4 = collections.Counter([a[3] for a in answer_list])
ans5 = collections.Counter([a[4] for a in answer_list])
ans_all = collections.Counter(''.join(answer_list))

pprint.pp(ans_all)
## Counter({'e': 1233,
##          'a': 979,
##          'r': 899,
##          'o': 754,
##          't': 729,
##          'l': 719,
##          'i': 671,
##          's': 669,
##          'n': 575,
##          'c': 477,
##          'u': 467,
##          'y': 425,
##          'd': 393,
##          'h': 389,
##          'p': 367,
##          'm': 316,
##          'g': 311,
##          'b': 281,
##          'f': 230,
##          'k': 210,
##          'w': 195,
##          'v': 153,
##          'z': 40,
##          'x': 37,
##          'q': 29,
##          'j': 27})

E, A, R, O, T are the most popular five letters. I can’t think of any way to combine those into an actual word though.

What are the most popular letters in each position?

import pandas as pd

# we need to sort the dictionaries first
ans1 = dict(sorted(ans1.items(), key=lambda item: item[1], reverse=True))
ans2 = dict(sorted(ans2.items(), key=lambda item: item[1], reverse=True))
ans3 = dict(sorted(ans3.items(), key=lambda item: item[1], reverse=True))
ans4 = dict(sorted(ans4.items(), key=lambda item: item[1], reverse=True))
ans5 = dict(sorted(ans5.items(), key=lambda item: item[1], reverse=True))

# convert dictionary items into columns
position_counts = pd.DataFrame(ans1.items(), columns=['L1', 'C1'])

# There are no words starting with X!
# add a zero entry
position_counts.loc[len(position_counts.index)] = ['x', 0]

# temp dataframes
temp_df = pd.DataFrame(ans2.items(), columns=['L2', 'C2'])
# we need to check the lengths because joining on index assumes equal lengths
len(temp_df)
## 26
# join to the counter dataframe
position_counts = position_counts.join(temp_df[['L2', 'C2']].set_axis(position_counts.index))

# i'm sure this could be more efficient but it's what I've got 
temp_df = pd.DataFrame(ans3.items(), columns=['L3', 'C3'])
len(temp_df)
## 26
position_counts = position_counts.join(temp_df[['L3', 'C3']].set_axis(position_counts.index))

# position 4 is missing q
temp_df = pd.DataFrame(ans4.items(), columns=['L4', 'C4'])
len(temp_df)
## 25
temp_df
##    L4   C4
## 0   e  318
## 1   n  182
## 2   s  171
## 3   a  163
## 4   l  162
## 5   i  158
## 6   r  152
## 7   c  152
## 8   t  139
## 9   o  132
## 10  u   82
## 11  g   76
## 12  d   69
## 13  m   68
## 14  k   55
## 15  p   50
## 16  v   46
## 17  f   35
## 18  h   28
## 19  w   25
## 20  b   24
## 21  z   20
## 22  x    3
## 23  y    3
## 24  j    2
temp_df.loc[len(temp_df.index)] = ['q', 0]
position_counts = position_counts.join(temp_df[['L4', 'C4']].set_axis(position_counts.index))

# position 5 is missing j, q, v
temp_df = pd.DataFrame(ans5.items(), columns=['L5', 'C5'])
len(temp_df)
## 23
temp_df
##    L5   C5
## 0   e  424
## 1   y  364
## 2   t  253
## 3   r  212
## 4   l  156
## 5   h  139
## 6   n  130
## 7   d  118
## 8   k  113
## 9   a   64
## 10  o   58
## 11  p   56
## 12  m   42
## 13  g   41
## 14  s   36
## 15  c   31
## 16  f   26
## 17  w   17
## 18  b   11
## 19  i   11
## 20  x    8
## 21  z    4
## 22  u    1
temp_df.loc[len(temp_df.index)] = ['j', 0]
temp_df.loc[len(temp_df.index)] = ['q', 0]
temp_df.loc[len(temp_df.index)] = ['v', 0]
position_counts = position_counts.join(temp_df[['L5', 'C5']].set_axis(position_counts.index))

# there's probably a way to start with the letters and count them with collections
position_counts
##    L1   C1 L2   C2 L3   C3 L4   C4 L5   C5
## 0   s  366  a  304  a  307  e  318  e  424
## 1   c  198  o  279  i  266  n  182  y  364
## 2   b  173  r  267  o  244  s  171  t  253
## 3   t  149  e  242  e  177  a  163  r  212
## 4   p  142  i  202  u  165  l  162  l  156
## 5   a  141  l  201  r  163  i  158  h  139
## 6   f  136  u  186  n  139  r  152  n  130
## 7   g  115  h  144  l  112  c  152  d  118
## 8   d  111  n   87  t  111  t  139  k  113
## 9   m  107  t   77  s   80  o  132  a   64
## 10  r  105  p   61  d   75  u   82  o   58
## 11  l   88  w   44  g   67  g   76  p   56
## 12  w   83  c   40  m   61  d   69  m   42
## 13  e   72  m   38  p   58  m   68  g   41
## 14  h   69  y   23  b   57  k   55  s   36
## 15  v   43  d   20  c   56  p   50  c   31
## 16  o   41  b   16  v   49  v   46  f   26
## 17  n   37  s   16  y   29  f   35  w   17
## 18  i   34  v   15  w   26  h   28  b   11
## 19  u   33  x   14  f   25  w   25  i   11
## 20  q   23  g   12  x   12  b   24  x    8
## 21  k   20  k   10  k   12  z   20  z    4
## 22  j   20  f    8  z   11  x    3  u    1
## 23  y    6  q    5  h    9  y    3  j    0
## 24  z    3  j    2  j    3  j    2  q    0
## 25  x    0  z    2  q    1  q    0  v    0

Thoughts from these counts:

  • Not that many words start with E, and it’s only fourth most popular for the second and third letters, but comes in strong for fourth and fifth letters.
  • Vowels take all the top spots for the third letter, but only one word ends in U
  • It is not possible to just take the most popular for each position and I’m not sure why I thought that might be a possibility.
  • Overall frequency suggests E, A, R, O, T, L, I, S. Combine that with S being the only consonant as the most popular in any position, and something like SLATE, STARE, STORE is a good starting option.
  • There are many conditionals on finding a good starting word but I’m sure there is some way of deciding the optimal.
  • AUDIO is a good starting word to identify vowels, and follow that up with a word with E in it.
  • Maybe we can loop through all the possible answers and rank them by a score, the score being the sum of the frequency of the letters in each position.
# combine frequencies into one list
list_of_freqs = [ans1, ans2, ans3, ans4, ans5]

# function for calculating score
def create_score(word): 
    score = 0
    for w, s in zip(list(word), list_of_freqs): 
        score += s[w]
    return score

        
# new dictionary with words as keys, and scores as values
word_score = {x: create_score(x) for x in answer_list}

# sort the dictionary
word_score = dict(sorted(word_score.items(), key=lambda item: item[1], reverse=True))

# print out the top 20 words
pprint.pp(list(word_score.items())[:15])
## [('slate', 1437),
##  ('sauce', 1411),
##  ('slice', 1409),
##  ('shale', 1403),
##  ('saute', 1398),
##  ('share', 1393),
##  ('sooty', 1392),
##  ('shine', 1382),
##  ('suite', 1381),
##  ('crane', 1378),
##  ('saint', 1371),
##  ('soapy', 1366),
##  ('shone', 1360),
##  ('shire', 1352),
##  ('saucy', 1351)]

How do my suggestions hold up? SLATE is number one!

print('stare: ' + str(word_score['stare']))
## stare: 1326
print('store: ' + str(word_score['store']))
## store: 1263

STARE is reasonable but STORE is lower down. I’m also kind of pleased that there aren’t many words that use the overall frequency leaders, it’s not that I can’t think of any, we often have to dip into c, h, n. 

Combos

Maybe we could do letter combinations? SH is a popular starting combo. We can’t combine these in the same way as individual letters though: first and second letter plus second and third letter plus third and fourth plus four and five need to be combined in a cohesive way.

starter = collections.Counter([a[:2] for a in answer_list])
finisher = collections.Counter([a[3:] for a in answer_list])

print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('st', 65),
##  ('sh', 52),
##  ('cr', 45),
##  ('sp', 45),
##  ('ch', 40),
##  ('gr', 38),
##  ('re', 36),
##  ('fl', 36),
##  ('tr', 36),
##  ('br', 35),
##  ('ma', 34),
##  ('bl', 32),
##  ('ca', 32),
##  ('cl', 31),
##  ('mo', 29)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('er', 141),
##  ('ch', 58),
##  ('ly', 56),
##  ('se', 52),
##  ('al', 50),
##  ('ck', 47),
##  ('ty', 46),
##  ('te', 39),
##  ('el', 38),
##  ('dy', 38),
##  ('ng', 38),
##  ('ge', 38),
##  ('ve', 37),
##  ('nt', 37),
##  ('th', 36)]

ER is vastly more popular than any other starting or ending combination. Interesting to see not so popular letters make an appearance for starting, like GR and FL

Let’s try again with more letters!

starter = collections.Counter([a[:3] for a in answer_list])
finisher = collections.Counter([a[2:] for a in answer_list])

print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('sta', 19),
##  ('sha', 18),
##  ('sto', 16),
##  ('gra', 15),
##  ('cha', 13),
##  ('cra', 12),
##  ('spi', 12),
##  ('sho', 12),
##  ('pri', 11),
##  ('bra', 11),
##  ('cre', 11),
##  ('tra', 10),
##  ('bri', 10),
##  ('ste', 10),
##  ('gro', 10)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('ing', 23),
##  ('lly', 22),
##  ('tch', 18),
##  ('ter', 16),
##  ('ack', 15),
##  ('nch', 14),
##  ('tty', 14),
##  ('ver', 14),
##  ('rry', 13),
##  ('unt', 13),
##  ('ash', 13),
##  ('ank', 12),
##  ('dge', 12),
##  ('ate', 11),
##  ('ide', 11)]
starter = collections.Counter([a[:4] for a in answer_list])
finisher = collections.Counter([a[1:] for a in answer_list])

print("most common starting combinations")
## most common starting combinations
pprint.pp(starter.most_common(15))
## [('basi', 4),
##  ('stee', 4),
##  ('stor', 4),
##  ('brin', 4),
##  ('shar', 4),
##  ('stea', 4),
##  ('scal', 4),
##  ('cove', 4),
##  ('mang', 4),
##  ('shee', 4),
##  ('spoo', 4),
##  ('stin', 3),
##  ('stoo', 3),
##  ('fort', 3),
##  ('star', 3)]
print("most common ending combinations")
## most common ending combinations
pprint.pp(finisher.most_common(15))
## [('ight', 9),
##  ('ound', 8),
##  ('ower', 7),
##  ('atch', 7),
##  ('atty', 6),
##  ('aunt', 6),
##  ('aste', 6),
##  ('illy', 6),
##  ('usty', 5),
##  ('ying', 5),
##  ('each', 5),
##  ('ater', 5),
##  ('ully', 5),
##  ('ough', 5),
##  ('andy', 5)]

The ending combinations are definitely the ones to watch out for. There’s the infamous IGHT but also OUND, OWER, and ATCH.

I wish the Twitter API was still available to see how many fails there are on days where those combinations occur.

Conclusions

Thanks to the Web Archive, I’m able to get the old version of the Wordle code, and do some exploration to try to find a good starting word. Python list comprehensions are cool; The Collections package was super helpful; I should definitely think about combining data structures instead of having five different dictionaries; functions can be easy; SLATE is a good starting word.