!wc -l /usr/share/dict/words
!head /usr/share/dict/words
!tail /usr/share/dict/words
Having a list of most English words sitting on your computer is useful!
Often my mom calls me on Sundays to tell me the weekly National Public Radio Puzzle—almost always a word puzzle—and often I can solve it with a quick script using the words
file.
It would be even cooler if there was a file that not only let your computer know what words exist, but also what they mean.
If you download some word embeddings then you'll have that magical file.
They're a way of turning words into numerical vectors, which can be useful because computers are often better at numbers than words.
Two recent popular word embeddings are word2vec and GloVe (Global Vectors).
The vectors are determined in completely different ways, but end up behaving very similarly, so for this thing I downloaded some GloVe vectors.
Let's see how many words it contains, and what some look like:
!wc -l glove.6B/glove.6B.300d.txt
!head glove.6B/glove.6B.300d.txt | cut -c -100
!tail glove.6B/glove.6B.300d.txt | cut -c -100
So it's a lot like /usr/share/dict/words
except that every word is followed by 300 numbers.
We can load all of these vectors into memory like so:
import numpy as np
from tqdm import tqdm
vectors = {}
with open('glove.6B/glove.6B.300d.txt') as f:
for line in tqdm(f, total=400000):
word, vector = line.split(maxsplit=1)
v = np.fromstring(vector, sep=' ', dtype='float32')
vectors[word] = v / np.linalg.norm(v)
Let's see what a vector looks like. Every word is represented as 300 numbers:
vectors['kitten']
Of course, we can't see 300 dimensions, so let's plot some words in 2 dimensions, squashing them down from 300 using PCA (Principal Component Analysis):
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
%matplotlib inline
plt.rcParams["figure.figsize"] = (18, 10)
def plot_words(*words, lines=False):
pca = PCA(n_components=2)
xys = pca.fit_transform([vectors[w] for w in words])
if lines:
for i in range(0, len(words), 2):
plt.plot(xys[i:i+2, 0], xys[i:i+2, 1])
else:
plt.scatter(*xys.T)
for word, xy in zip(words, xys):
plt.annotate(word, xy, fontsize=20)
return pca
When we plot the words, we'll see that similar words end up in similar locations, so that suggests the vectors have captured something about English, cool!
plt.title('similar words have similar vectors', fontsize=20)
plot_words('stream', 'euro', 'baseball', 'mountain', 'computer', 'lake', 'yen',
'monkey', 'dog', 'basketball', 'cat', 'river', 'piano')
We can search all of the vectors to find the most similar words to a given word. (A fun way to do this is using a heap.)
import heapq
def most_similar(v, *ignore, N=1):
similar = []
for word, u in vectors.items():
if word in ignore: continue
similarity = u.dot(v)
if len(similar) < N:
heapq.heappush(similar, (similarity, word))
else:
heapq.heappushpop(similar, (similarity, word))
return sorted(similar, reverse=True)
The 10 most similar words to "piano" seem pretty reasonable. The top word is of course "piano" itself:
most_similar(vectors['piano'], N=10)
Here's where things start to get interesting. Since every word is a vector, we can do things like add and subtract them.
If we subtract "france" from "paris", we now have a vector that sort of represents "capital of".
Here are a bunch of countries and capitals plotted in 2 dimensions. The vectors from each country to capital are all approximately the same "capital of" vector:
plt.title('similar relationships have similar vectors - "capital of"', fontsize=20)
plot_words('china', 'beijing',
'japan', 'tokyo',
'france', 'paris',
'russia', 'moscow',
'italy', 'rome',
'spain', 'madrid',
'greece', 'athens',
'turkey', 'ankara',
'portugal', 'lisbon', lines=True)
We can see a similar consistency when visualizing the vector from a comparative adjective to its superlative:
plt.title('similar relationships have similar vectors - "comparative -> superlative"', fontsize=20)
plot_words('larger', 'largest',
'smarter', 'smartest',
'happier', 'happiest',
'dumber', 'dumbest',
'angrier', 'angriest', lines=True)
Since the "capital of" vector seems to be so consistent, we could use it to find a capital we don't know.
Let's say we've forgotten what the capital of Poland is, but we remember that Paris is the capital of France.
Then we can just solve for x = Paris - France + Poland
.
Sure enough, the most similar vector is x ≈ Warsaw
!
most_similar(vectors['paris'] - vectors['france'] + vectors['poland'])
We can also visualize this vector arithmetic by taking the Paris - France
vector and moving it over to Poland
:
plt.title('using "capital of" relationship vector from France to find capital of Poland', fontsize=20)
pca = plot_words('china', 'beijing',
'japan', 'tokyo',
'france', 'paris',
'russia', 'moscow',
'italy', 'rome',
'spain', 'madrid',
'greece', 'athens',
'turkey', 'ankara',
'portugal', 'lisbon',
'poland', 'warsaw')
paris, france, poland = pca.transform([vectors[x] for x in ('paris', 'france', 'poland')])
capital_of = paris - france
capital_of_france = capital_of + france
plt.plot([france[0], capital_of_france[0]], [france[1], capital_of_france[1]], 'b')
capital_of_poland = capital_of + poland
plt.plot([poland[0], capital_of_poland[0]], [poland[1], capital_of_poland[1]], 'b')
We can do the same thing to see that Queen - Woman + Man ≈ King
:
most_similar(vectors['queen'] - vectors['woman'] + vectors['man'], 'queen')
So if we let Regal = Queen - Woman
, then Regal + Man ≈ King
.
But vector addition is commutative, so it's equivalent to let Masculine = Man - Woman
, and then Masculine + Queen ≈ King
.
These are equivalent because by commutativity and associativity of vector addition:
Regal + Man = (Queen - Woman) + Man = (Man - Woman) + Queen = Masculine + Queen
…which you can visualize as the two ways of traversing the parallelogram in the below plot.
This is related to the concept of a "commutative diagram" in algebra:
plt.title('commutativity: "regal man" = (queen - woman) + man = (man - woman) + queen = "masculine queen"', fontsize=20)
pca = plot_words('man', 'king',
'woman', 'queen',
'boy', 'prince',
'girl', 'princess')
queen, woman, man = pca.transform([vectors[x] for x in ('queen', 'woman', 'man')])
regal = queen - woman
masculine = man - woman
regal_woman = regal + woman
plt.plot([woman[0], regal_woman[0]], [woman[1], regal_woman[1]], 'b')
masculine_woman = masculine + woman
plt.plot([woman[0], masculine_woman[0]], [woman[1], masculine_woman[1]], 'g')
regal_man = regal + man
plt.plot([man[0], regal_man[0]], [man[1], regal_man[1]], 'b')
masculine_queen = masculine + queen
plt.plot([queen[0], masculine_queen[0]], [queen[1], masculine_queen[1]], 'g')
Another example while we're at it, Kitten - Cat + Dog ≈ Puppy
:
most_similar(vectors['kitten'] - vectors['cat'] + vectors['dog'], 'kitten')
Does each of the 300 dimensions correspond to some human-friendly spectrum, such as "big <-> small" or "dark <-> light"?
We can look at example words which are evenly spaced along a dimension, but it usually ends up being pretty hard to tell what the dimension "means", for me at least.
It looks like dimension #2 is kind of like "traditional <-> modern"? But it's a bit of a stretch:
import itertools as it
N = 10000
top_vectors = {}
with open('glove.6B/glove.6B.50d.txt') as f:
for line in tqdm(it.islice(f, N), total=N):
word, vector = line.split(maxsplit=1)
v = np.fromstring(vector, sep=' ', dtype='float32')
top_vectors[word] = v / np.linalg.norm(v)
def print_dimension_examples(i, N=10):
xs = {w: v[i] for w, v in top_vectors.items()}
for x in np.linspace(min(xs.values()), max(xs.values()), N):
print(*min(xs.items(), key=lambda w_x: abs(x - w_x[1])))
print_dimension_examples(2, N=10)