Introduction to Word Embeddings


The early work on representing words in a vector space is exploited in Information Retrieval as a side effect, where the feature vectors of words is learned based on the word-document co-occurrence(LSI)[1]. Then Y. Bengio et. al. proposed a neural network based model(NNLM, [2]) that learns the word vectors and the language model at the same time. Mikolov propose the CBOW and Skip-Gram model[3] which simplifies the NNLM by removing the hidden layer, and accelerate the training by Hierarchical SoftMax or Negative Sampling[4]. Finally J. Pennington propose the GloVe[5] which combines the global matrix factorization methods(LSI) and local context window methods(Skip-Gram).

Latent Semantic Indexing

LSI first arise in the information retrieval literature to match documents for each query, it does a matrix factorization based on the word-document co-occurrence matrixX:

X X = T_0 \cdot S_0 \cdot D_0' \tag{1}

where X \in \mathbb{R}^{v \cdot d}, T_0 \in \mathbb{R}^{v \cdot v}, S_0 \in \mathbb{R}^{v \cdot d}, D_0 \in \mathbb{R}^{d \cdot d}, and S_0 is a rectangular diagonal matrix.

If we only keep the k largest value of the S_0 matrix, then we can get:

X \approx \tilde X = T \cdot S \cdot D' \tag{2}

Where T \in \mathbb{R}^{v \cdot k}, S \in \mathbb{R}^{k \cdot k}, D \in \mathbb{R}^{d \cdot k}

So each row of the T is the word embedding vector for each word.

Neural Net Language Model(NNLM)

Y. Bengio et. al. propose a probabilistic feedforward neural network language model, which learns the language model as well as word embeddings at the same time. The network consists of input, hidden and output layers, as below:

The NNLM learns the language model as well as word embedding by go through the training data using a window of size N, at each time the NNLM predict the last word in the window based on the first N-1 words. So the input is a (N-1) \times D-dimension vector, which is the concatenation of the word vectors of the first N-1 words. The hidden layer consists of H hidden nodes, with sigmoid or tanh activation function. The output layer is a softmax layer with dimension V, e.g., the number of unique words.

Skip-Gram & CBow

While the NNLM can learn meaningful word embeddings, it's computational complexity is quite high, e.g. for each training sample(window), the model must learn Q = (N - 1) \times D + (N - 1) \times D \times H + H \times V parameters. T. Mikolov propose two more efficient neural network architectures which can learn the word embedding much faster: CBoW and Skip-Gram.

The main difference between these two new methods and the NNLM is that, there is no hidden layer any more. The COBW model predict a word based on the context words(words around the target word), while the Skip-Gram predict the context words based on the current word, as below:


The GloVe looks very similar with the LSA, but is explicitly designed to model semantic and syntactic regularities. Instead of factorize the document-word co-occurrence, the GloVe models the logarithm of the co-occurrence of a word and its context words as:

w_i^T\tilde w_k + b_i + \tilde b_k = log(X_{ij})\tag{3}

where w_i and \tilde w_k is the word embeddings of a word and its context word, while b_i and \tilde b_k are the corresponding biases. The X_{ik} is their co-occurrence of word k appears in the context of word i.

So the loss function of the GloVe is:

J = \sum_{i, j=1}^Vf(X_{ij})(w_i^T \tilde w_j + b_i+ \tilde b_j - log X_{ij})^2\tag{4}

where f is a weighting function:

f(x) = \begin{cases} (\frac{x}{x_{max}})^{\alpha} & \text{if } x \lt x_{max} \\ 1 \text{otherwise}\end{cases}\tag{5}

Why does it works?

J. Pennington et. at have done some interesting analysis of word embedding algorithms. They classify these algorithms to Global Matrix Factorization Methods such as LSA and Local Context Window Methods such as NNLM, CBOW and Skip-Gram, while at last they propose the GloVe algorithm which combines the advantage of both.

Global Matrix Factorization Methods utilize the global statistics information of the corpus. For example, the LSA does the matrix factorization on the document-word co-occurrence matrix to get the word vectors. However, LSA performs poor on the word analogy task.

Local Context Window Methods are trained on the local context windows instead of the statistics of the whole corpus and perform good on the word analogy task. However, P. Pennington proved that the Skip-Gram is just an on-line version of the GloVe.

The Skip-Gram models the probability that word j appears in the context of word i as

Q_{ij} = \frac{exp(w_i^T\tilde w_j)}{\sum_{k=1}^Vexp(w_i^T\tilde w_k)}\tag{6}

Skip-Gram is trained in an on-line, stochastic way, but the implied global objective function can be written as

J = -\sum_{i\in corpus, j\in context(i)}logQ_{ij}\tag{7}

The above equation can be evaluated much more efficiently if we group those terms that have the same values for i and j:

J = -\sum_{i=1}^Vsum_{j=1}^V X_{ij}logQ_{ij}\tag{8}

Since X_i = \sum_k X_{ik} and P_{ij} = \frac{X_{ij}}{X_i}, so J can be rewritten as

J = -\sum_{i=1}^V X_i \sum_{j=1}^VP_{ij}log Q_{ij} = \sum_{i=1}^VX_iH(P_i, Q_i)\tag{9}

We can find that the above equation (9) is very similar with (4). While the Skip-Gram uses the cross entropy error while GloVe use the lease square error.

Intuitive Interpretation

So why does the above word embedding methods can perform so good on the word analogy task? Is there any intuitive interpretation?

Absolutely yes, here we take Skip-Gram as example, since the other methods are very similar with it. Say that the word pear and orange is very similar. For two words to be similar, we mean that they often appears in the same context. For the pear-orange case, a very common context can be She eat some pear/orange, or I want a glass of pear/orange juice. Since they often share the same context, so in training, their word vectors should be very similar with each other so that they can predict their common context words well.

Another very interesting property of the word embedding is that the offset of the word vectors is also meaning full, for example, x_{queen} - x_{woman} \approx x_{king} - x_{man}. An intuitive interpretation is that, we can think each word as a compose of concepts, e.g. x_{king} = x_{man} + x_{ruler} + x_{royalty} + x_{human} + x_{adult} + \cdots. So x_{queen} - [\text{concept of woman}] + [\text{concept of man}] \approx x_{king}


[1]. Indexing by latent semantic analysis. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. JASIS. 1990.
[2]. A Neural Probabilistic Language Model. Y. Bengio, R. Ducharme, P. Voncent, C. Jauvin. JMLR. 2003.
[3]. Efficient Estimation of Word Representation in Vector Space. T. Mikolov, K. Chen, G. Corrado, J. Dean. arXiv:1301.3781v3
[4]. Distributed Representations of Words and Phrases and their Compositionality. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. arXiv:1310.4546v1.
[5]. GloVe: Global Vectors for Word Representation. J. Pennington, R. Socher, C. D. Manning.