Fooling Around with Word Embeddings

/usr/share/dict/words

There's a file on Unix-like systems called words which is a big list of words.

Let's see how many words it contains, and look at some examples from the beginning and end:

In [3]:
!wc -l /usr/share/dict/words
!head /usr/share/dict/words
!tail /usr/share/dict/words
  235886 /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
zymotoxic
zymurgy
Zyrenian
Zyrian
Zyryan
zythem
Zythia
zythum
Zyzomys
Zyzzogeton

Having a list of most English words sitting on your computer is useful!

Often my mom calls me on Sundays to tell me the weekly National Public Radio Puzzle—almost always a word puzzle—and often I can solve it with a quick script using the words file.

/usr/share/dict/vectors (should be a thing)

It would be even cooler if there was a file that not only let your computer know what words exist, but also what they mean.

If you download some word embeddings then you'll have that magical file.

They're a way of turning words into numerical vectors, which can be useful because computers are often better at numbers than words.

Two recent popular word embeddings are word2vec and GloVe (Global Vectors).

The vectors are determined in completely different ways, but end up behaving very similarly, so for this thing I downloaded some GloVe vectors.

Let's see how many words it contains, and what some look like:

In [8]:
!wc -l glove.6B/glove.6B.300d.txt
!head glove.6B/glove.6B.300d.txt | cut -c -100
!tail glove.6B/glove.6B.300d.txt | cut -c -100
  400000 glove.6B/glove.6B.300d.txt
the 0.04656 0.21318 -0.0074364 -0.45854 -0.035639 0.23643 -0.28836 0.21521 -0.13486 -1.6413 -0.26091
, -0.25539 -0.25723 0.13169 -0.042688 0.21817 -0.022702 -0.17854 0.10756 0.058936 -1.3854 0.58509 0.
. -0.12559 0.01363 0.10306 -0.10123 0.098128 0.13627 -0.10721 0.23697 0.3287 -1.6785 0.22393 0.12409
of -0.076947 -0.021211 0.21271 -0.72232 -0.13988 -0.12234 -0.17521 0.12137 -0.070866 -1.5721 -0.2246
to -0.25756 -0.057132 -0.6719 -0.38082 -0.36421 -0.082155 -0.010955 -0.082047 0.46056 -1.8477 -0.112
and 0.038466 -0.039792 0.082747 -0.38923 -0.21431 0.1702 -0.025657 0.09578 0.2386 -1.6342 0.14332 -0
in -0.44399 0.12817 -0.25247 -0.18582 -0.16614 0.25909 -0.22678 -0.069229 -0.077204 -1.5814 0.10753
a -0.29712 0.094049 -0.096662 -0.344 -0.18483 -0.12329 -0.11656 -0.099692 0.17265 -1.6386 0.1022 0.0
" 0.6947 0.22184 0.10526 0.012382 -0.2558 -0.32645 -0.48287 0.51755 -0.0872 -2.0289 0.35021 0.045363
's -0.001272 0.36514 -0.077363 -0.26559 0.17987 0.15347 -0.15338 0.43267 -0.13364 -1.716 0.069153 -0
sigarms 0.14649 -0.47266 0.17144 0.26431 -0.13895 -0.20788 0.41624 0.078204 0.10015 1.1079 0.18251 -
katuna -0.030013 0.24626 0.068192 0.089033 -0.19977 -0.92317 0.41307 -0.49583 0.4965 0.38058 -0.4684
aqm 0.46348 -0.42811 0.4575 0.25317 0.58327 -0.3598 0.36049 -0.16522 -0.27769 0.52559 -0.083879 -0.0
1.3775 0.71376 -0.56625 -0.18468 0.30104 -0.56443 -0.0068945 -0.31358 -0.35351 0.40245 0.90999 -0.43
corythosaurus 0.88649 -0.095745 0.18961 0.012919 -0.40925 -0.17462 0.20691 0.038473 0.041227 0.7609
chanty 0.3927 -0.022505 0.30458 0.18799 0.14118 0.72403 -0.25781 -0.13729 -0.016521 0.59596 -0.11014
kronik 0.13679 -0.13909 -0.36089 0.079864 0.32149 0.26387 -0.1099 0.04442 0.083869 0.79133 0.33604 -
rolonda 0.075713 -0.040502 0.18345 0.5123 -0.22856 0.83911 0.17878 -0.71301 0.3269 0.69535 0.19446 -
zsombor 0.81451 -0.36221 0.31186 0.81381 0.18852 -0.3136 0.82784 0.29656 -0.085519 0.47597 0.23528 0
sandberger 0.429191 -0.296897 0.15011 0.245201 -0.00352027 -0.0576971 0.1409 -0.222294 0.221153 0.76

So it's a lot like /usr/share/dict/words except that every word is followed by 300 numbers.

We can load all of these vectors into memory like so:

In [1]:
import numpy as np
from tqdm import tqdm

vectors = {}
with open('glove.6B/glove.6B.300d.txt') as f:
    for line in tqdm(f, total=400000):
        word, vector = line.split(maxsplit=1)
        v = np.fromstring(vector, sep=' ', dtype='float32')
        vectors[word] = v / np.linalg.norm(v)
100%|██████████| 400000/400000 [00:47<00:00, 8356.88it/s]

Example

Let's see what a vector looks like. Every word is represented as 300 numbers:

In [89]:
vectors['kitten']
Out[89]:
array([-0.03189607,  0.00502735, -0.03142706, -0.09080496, -0.01372096,
       -0.00790456,  0.030317  , -0.07247453, -0.04573111,  0.05332294,
        0.07288674, -0.08122185,  0.00484673, -0.01537791,  0.06188687,
        0.03534793, -0.04661558, -0.01694448, -0.04042754,  0.00050012,
        0.03243811,  0.11682782, -0.02655192,  0.06118741, -0.07047677,
       -0.0037469 ,  0.02992264,  0.03477181,  0.02501019,  0.01804155,
       -0.05208955, -0.09751231, -0.03454461, -0.04255676,  0.07700237,
        0.04495537, -0.05921724,  0.01425699,  0.00727082,  0.08407649,
        0.00414386,  0.0257486 ,  0.0416804 , -0.09127396, -0.06493788,
       -0.01052713,  0.11504428, -0.04301928, -0.04149053, -0.01986242,
       -0.05381305, -0.05122618,  0.00951202,  0.07270174,  0.015696  ,
       -0.00185138, -0.02895541,  0.10035073, -0.01022527, -0.03901888,
       -0.0286357 , -0.00038816, -0.06632058,  0.00219413,  0.04281642,
       -0.10785005,  0.11508809, -0.01601003, -0.01545094, -0.02594334,
       -0.06376292,  0.05213661, -0.10611357, -0.01170696,  0.03170781,
        0.09749608,  0.04564185,  0.03465658,  0.03528626, -0.01317632,
       -0.0658402 ,  0.06708658,  0.02708261, -0.00941676,  0.05004472,
       -0.09964802, -0.06706873,  0.09575149,  0.03073084, -0.0779128 ,
       -0.10904612, -0.00137211,  0.01427436,  0.03249491, -0.08582757,
       -0.06214815,  0.04685901, -0.01529872, -0.02868276, -0.03035758,
        0.01193076,  0.03669654, -0.02238437, -0.00971942, -0.02112015,
       -0.12391331, -0.02551491, -0.02295725, -0.06453866, -0.01971311,
        0.10288567,  0.11489497, -0.0503693 ,  0.0644429 ,  0.02602124,
        0.08316119,  0.08443028,  0.01934309,  0.01912725, -0.03951548,
       -0.05975765,  0.00633588,  0.05523144,  0.10743134,  0.02565285,
        0.0591085 , -0.06582073,  0.01399035,  0.02120291, -0.05888617,
        0.10206449, -0.09069946, -0.0107618 ,  0.10852517, -0.04826605,
       -0.0140061 , -0.02386281, -0.0154154 , -0.08144581, -0.06050904,
        0.09513642, -0.07653173, -0.07035505, -0.06641957,  0.0959511 ,
        0.00210147, -0.0283728 ,  0.0251952 ,  0.02327858,  0.0516822 ,
       -0.06586941,  0.12737978, -0.0497526 ,  0.09444507,  0.02383198,
        0.07142615,  0.00255052, -0.05192077, -0.00423393,  0.09097698,
        0.00370179, -0.09085364, -0.05976252, -0.00505705, -0.03072597,
       -0.01971474,  0.07135637,  0.03769137,  0.0279054 , -0.03008169,
       -0.06055124,  0.03197396, -0.06527382,  0.07851813,  0.00572016,
       -0.0784516 ,  0.02748995, -0.04415205,  0.0130929 , -0.01004708,
        0.08195702, -0.05701986, -0.00528685, -0.02261969, -0.01347201,
       -0.05679915,  0.13558668, -0.04854032,  0.11129056, -0.06001731,
        0.02259372, -0.09510072, -0.0601017 , -0.07179942, -0.13283914,
        0.04055088, -0.04183944, -0.0511499 ,  0.03285681,  0.05362966,
        0.10976992,  0.04941829,  0.06698109,  0.05332943,  0.01387919,
        0.07408281, -0.04666426,  0.00355118, -0.00381263,  0.04533675,
       -0.03874948,  0.07053033, -0.02114125,  0.08540725,  0.00161824,
        0.00835036, -0.01319466, -0.04281155, -0.00771225, -0.00845244,
       -0.10540923,  0.0800761 , -0.05904034,  0.01925546, -0.05956128,
       -0.02953153,  0.08487982, -0.02554736, -0.04823359, -0.0664569 ,
        0.05329373, -0.06990552, -0.03302883, -0.02510107, -0.00953555,
        0.02697387, -0.02970518,  0.01658745, -0.04818328, -0.03224336,
        0.074706  , -0.08186127, -0.09457977,  0.06484376, -0.00644494,
        0.02794598,  0.00331457, -0.02484141, -0.04680546,  0.05409705,
       -0.11592874,  0.02648376, -0.01906234,  0.00095768, -0.02165894,
       -0.05348847,  0.05069712, -0.03122906,  0.01530424, -0.03511424,
        0.02661684,  0.02367618,  0.03573255, -0.01845538, -0.05866708,
       -0.01240951, -0.01477177, -0.06186577,  0.0008141 ,  0.05032061,
       -0.12011739, -0.09979408, -0.10374416, -0.12338588, -0.09950358,
        0.04013867,  0.0424464 ,  0.04548768, -0.10236635,  0.08622194,
       -0.0280823 ,  0.14800008, -0.03903186, -0.0107777 , -0.07649115,
       -0.03742846,  0.03051662,  0.01065777, -0.07335252, -0.10409634,
       -0.03014498,  0.06561625,  0.07241936, -0.01217062,  0.00568624,
       -0.02907225,  0.05303569,  0.01673026,  0.08879907, -0.02625818], dtype=float32)

Plotting in 2 dimensions

Of course, we can't see 300 dimensions, so let's plot some words in 2 dimensions, squashing them down from 300 using PCA (Principal Component Analysis):

In [2]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

%matplotlib inline
plt.rcParams["figure.figsize"] = (18, 10)

def plot_words(*words, lines=False):
    pca = PCA(n_components=2)
    xys = pca.fit_transform([vectors[w] for w in words])

    if lines:
        for i in range(0, len(words), 2):
            plt.plot(xys[i:i+2, 0], xys[i:i+2, 1])
    else:
        plt.scatter(*xys.T)

    for word, xy in zip(words, xys):
        plt.annotate(word, xy, fontsize=20)

    return pca

When we plot the words, we'll see that similar words end up in similar locations, so that suggests the vectors have captured something about English, cool!

In [17]:
plt.title('similar words have similar vectors', fontsize=20)

plot_words('stream', 'euro', 'baseball', 'mountain', 'computer', 'lake', 'yen',
           'monkey', 'dog', 'basketball', 'cat', 'river', 'piano')
Out[17]:
PCA(copy=True, n_components=2, whiten=False)

Similar words

We can search all of the vectors to find the most similar words to a given word. (A fun way to do this is using a heap.)

In [92]:
import heapq

def most_similar(v, *ignore, N=1):
    similar = []
    for word, u in vectors.items():
        if word in ignore: continue
        similarity = u.dot(v)
        if len(similar) < N:
            heapq.heappush(similar, (similarity, word))
        else:
            heapq.heappushpop(similar, (similarity, word))
    return sorted(similar, reverse=True)

The 10 most similar words to "piano" seem pretty reasonable. The top word is of course "piano" itself:

In [93]:
most_similar(vectors['piano'], N=10)
Out[93]:
[(0.99999994, 'piano'),
 (0.81736529, 'violin'),
 (0.75439155, 'cello'),
 (0.71672165, 'guitar'),
 (0.66841698, 'concerto'),
 (0.64464086, 'clarinet'),
 (0.63440984, 'saxophone'),
 (0.63233328, 'pianist'),
 (0.6304934, 'harpsichord'),
 (0.62700754, 'orchestral')]

Relationship vectors

Here's where things start to get interesting. Since every word is a vector, we can do things like add and subtract them.

If we subtract "france" from "paris", we now have a vector that sort of represents "capital of".

Here are a bunch of countries and capitals plotted in 2 dimensions. The vectors from each country to capital are all approximately the same "capital of" vector:

In [19]:
plt.title('similar relationships have similar vectors - "capital of"', fontsize=20)

plot_words('china', 'beijing',
           'japan', 'tokyo',
           'france', 'paris',
           'russia', 'moscow',
           'italy', 'rome',
           'spain', 'madrid',
           'greece', 'athens',
           'turkey', 'ankara',
           'portugal', 'lisbon', lines=True)
Out[19]:
PCA(copy=True, n_components=2, whiten=False)

We can see a similar consistency when visualizing the vector from a comparative adjective to its superlative:

In [21]:
plt.title('similar relationships have similar vectors - "comparative -> superlative"', fontsize=20)

plot_words('larger', 'largest',
           'smarter', 'smartest',
           'happier', 'happiest',
           'dumber', 'dumbest',
           'angrier', 'angriest', lines=True)
Out[21]:
PCA(copy=True, n_components=2, whiten=False)

Relationship vectors can solve analogies

Since the "capital of" vector seems to be so consistent, we could use it to find a capital we don't know.

Let's say we've forgotten what the capital of Poland is, but we remember that Paris is the capital of France.

Then we can just solve for x = Paris - France + Poland.

Sure enough, the most similar vector is x ≈ Warsaw!

In [96]:
most_similar(vectors['paris'] - vectors['france'] + vectors['poland'])
Out[96]:
[(0.85991102, 'warsaw')]

We can also visualize this vector arithmetic by taking the Paris - France vector and moving it over to Poland:

In [23]:
plt.title('using "capital of" relationship vector from France to find capital of Poland', fontsize=20)

pca = plot_words('china', 'beijing',
                 'japan', 'tokyo',
                 'france', 'paris',
                 'russia', 'moscow',
                 'italy', 'rome',
                 'spain', 'madrid',
                 'greece', 'athens',
                 'turkey', 'ankara',
                 'portugal', 'lisbon',
                 'poland', 'warsaw')
paris, france, poland = pca.transform([vectors[x] for x in ('paris', 'france', 'poland')])
capital_of = paris - france

capital_of_france = capital_of + france
plt.plot([france[0], capital_of_france[0]], [france[1], capital_of_france[1]], 'b')

capital_of_poland = capital_of + poland
plt.plot([poland[0], capital_of_poland[0]], [poland[1], capital_of_poland[1]], 'b')
Out[23]:
[<matplotlib.lines.Line2D at 0x11b544860>]

We can do the same thing to see that Queen - Woman + Man ≈ King:

In [97]:
most_similar(vectors['queen'] - vectors['woman'] + vectors['man'], 'queen')
Out[97]:
[(0.74458629, 'king')]

So if we let Regal = Queen - Woman, then Regal + Man ≈ King.

But vector addition is commutative, so it's equivalent to let Masculine = Man - Woman, and then Masculine + Queen ≈ King.

These are equivalent because by commutativity and associativity of vector addition:

Regal + Man = (Queen - Woman) + Man = (Man - Woman) + Queen = Masculine + Queen

…which you can visualize as the two ways of traversing the parallelogram in the below plot.

This is related to the concept of a "commutative diagram" in algebra:

In [26]:
plt.title('commutativity: "regal man" = (queen - woman) + man = (man - woman) + queen = "masculine queen"', fontsize=20)

pca = plot_words('man', 'king',
                 'woman', 'queen',
                 'boy', 'prince',
                 'girl', 'princess')
queen, woman, man = pca.transform([vectors[x] for x in ('queen', 'woman', 'man')])
regal = queen - woman
masculine = man - woman

regal_woman = regal + woman
plt.plot([woman[0], regal_woman[0]], [woman[1], regal_woman[1]], 'b')
masculine_woman = masculine + woman
plt.plot([woman[0], masculine_woman[0]], [woman[1], masculine_woman[1]], 'g')

regal_man = regal + man
plt.plot([man[0], regal_man[0]], [man[1], regal_man[1]], 'b')
masculine_queen = masculine + queen
plt.plot([queen[0], masculine_queen[0]], [queen[1], masculine_queen[1]], 'g')
Out[26]:
[<matplotlib.lines.Line2D at 0x11be05940>]

Another example while we're at it, Kitten - Cat + Dog ≈ Puppy:

In [98]:
most_similar(vectors['kitten'] - vectors['cat'] + vectors['dog'], 'kitten')
Out[98]:
[(0.73260075, 'puppy')]

What are the 300 dimensions?

Does each of the 300 dimensions correspond to some human-friendly spectrum, such as "big <-> small" or "dark <-> light"?

We can look at example words which are evenly spaced along a dimension, but it usually ends up being pretty hard to tell what the dimension "means", for me at least.

It looks like dimension #2 is kind of like "traditional <-> modern"? But it's a bit of a stretch:

In [162]:
import itertools as it

N = 10000
top_vectors = {}
with open('glove.6B/glove.6B.50d.txt') as f:
    for line in tqdm(it.islice(f, N), total=N):
        word, vector = line.split(maxsplit=1)
        v = np.fromstring(vector, sep=' ', dtype='float32')
        top_vectors[word] = v / np.linalg.norm(v)

def print_dimension_examples(i, N=10):
    xs = {w: v[i] for w, v in top_vectors.items()}
    for x in np.linspace(min(xs.values()), max(xs.values()), N):
        print(*min(xs.items(), key=lambda w_x: abs(x - w_x[1])))
100%|██████████| 10000/10000 [00:00<00:00, 34847.09it/s]
In [168]:
print_dimension_examples(2, N=10)
ritual -0.473578
victorian -0.373939
influences -0.27019
examination -0.168541
created -0.066805
republics 0.0348972
interstate 0.136595
numbers 0.238045
x 0.339924
plug 0.441698