# Matrix Factorization

In Recommender System, Matrix Factorization maps both the users and items to a joint latent factor space of dimension $k$, such that the user-item interaction can be modeled as the inner product in this space. That is to say, we map each user $u$ to a vector $\mathbf{p}_u\in \mathbb{R}^k$, and each item $i$ to a vector $\mathbf{q}_i \in \mathbb{R}^k$. For movie recommendation, each dimension of the latent factor space can be explained as a topic, say comedy v.s. drama, or other features such as amount of action, orientation to children and so on.
Given the latent vector for user $u$ and item $i$, we can predict the interaction between them as

The major challenge is to compute the latent vector for each user and item. This is quite similar with Singular Value Decomposition(SVD), a well-established techniques for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by the sparseness in the user-item rating matrix. One approach is to rely on imputation to fill in missing ratings to make the matrix dense. However, this will significantly increase the amount of data, and the inaccurate imputation might distort the data.

Matrix Factorization is a method which focus only on the observed ratings only, while avoid overfitting by regularization. Here is the cost function for matrix factorization

Here $\kappa$ is the set of $(u, i)$ pairs for which $r_{ui}$ is known in the training set and $\lambda (||\mathbf{q}_i||^2 + ||\mathbf{p}_u||^2)$ is the regularization term.

Several learners have been proposed to optimize Matrix Factorization, including SGD, Coordinate Descent and MCMC.

## Optimization by SGD

SGD loops through all ratings in the training set. For each rating $r_{ui}$ in the training set, let $e_{ui} \stackrel{\text{def}}{=} r_{ui} - \mathbf{q}_i^T \mathbf{p}_u$. Then it modifies the parameter in the opposite direction of the gradient yielding:

$\mathbf{q}_i \gets \mathbf{q}_i + \gamma(e_{ui} \mathbf{p}_u -\lambda \mathbf{q}_i)$
$\mathbf{p}_u \gets \mathbf{p}_u + \gamma (e_{ui}\mathbf{q}_i -\lambda \mathbf{p}_u)$

## Optimization by ALS

Alternating Least Square is a popular approach to optimize regression models such as MF. It works by iteratively optimizing one parameter, while leaving others fixed. The prerequisite of ALS is that the optimization sub-problem can be analytically solved.

First, minimize $J$ with respect to user latent vector $\mathbf{p}_u$ is equivalent to minimizing:

$J_u = ||(r_u-\mathbf{Qp}_u)||^2 + \lambda ||\mathbf{p}_u||^2$

The minimum is where the first order derivative is 0:

$\frac{\partial J_u}{\partial \mathbf{p}_u} = 2\mathbf{Q}^T\mathbf{Qp}_u - 2\mathbf{Q}^Tr_u + 2\lambda \mathbf{p}_u = 0$
$\Rightarrow \mathbf{p}_u = (\mathbf{Q}^T\mathbf{Q} + \lambda \mathbf{I})^{-1}\mathbf{Q}^Tr_u$

Following the same process, we can get the solution for $\mathbf{q}_i$.

The $(2)$ only interpret the rating $r_{ui}$ as an interaction between the user $u$ and item $i$, but in the fact, the rating values can also due to effects associated with either users or items. For example, in the recommender system, some user tend to give higher rating than others, or some items is in general better than others. To consider all such effects, we can add a bias term to the $(1)$

$\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T\mathbf{p}_u\tag{3}$

and the corresponding cost function is

$\min_{\mathbf{p},\mathbf{q},b}\sum_{(u, i)\in \kappa}(r_{ui}-\mu-b_u-b_i-\mathbf{p}_u^T\mathbf{q}_i)^2 + \lambda(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2 + b_u^2 + b_i^2)\tag{4}$

## Handle user and item features

Matrix factorization can also handle user and item features beside the rating information. For example, if we also have the implicit feedback such as the purchase or browsing history, as well as some user attributes.
We denote $N(u)$ as the sets of items for which user $u$ has expressed an implicit feedback, where each item $i$ is associated with $\mathbf{x}_i \in \mathbb{R}^f$. So a user who showed a preference for items in $N(u)$ is characterized by

$|N(u)|^{-0.5}\sum_{i\in N(u)}\mathbf{x}_i\tag{5}$

Denote $A(u)$ as the set of attributes of a user $u$, and a factor vector $\mathbf{y}^a \in \mathbb{R}^f$ corresponds to each attribute to describe a user through the set of user-associated attributes:

$\sum_{a\in A(u)}\mathbf{y}^a\tag{6}$

Then the predicted rating can be modeled by

$\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T[\mathbf{p}_u + |N(u)|^{-0.5}\sum_{i \in N(u)}\mathbf{x}_i + \sum_{a\in A(u)}\mathbf{y}^a]\tag{7}$

## Factorization Machines

Although Matrix Factorization can model such kind of implicit feedback and user or item attribute features, Factorization Machine can handle such kind of features more directly. For each rating sample with user features(e.g. age, gender, rating history), item features(e.g. genders, price) as well as context features(e.g. time) , we can generate a feature vector for each of these features and then build a factorization machine of degree $d=2$ to model the interactions between them:

$\hat{y}(x) = w_0 + \sum_{i=1}^n \mathbf{w}_i \mathbf{x}_i + \sum_{i=1}^n\sum_{j=i+1}^n\langle \mathbf{v}_i, \mathbf{v}_j\rangle \mathbf{x}_i\mathbf{x}_j\tag{8}$

where $w_0 \in \mathbb{R}, \mathbf{w}\in \mathbf{R}^n, \mathbf{V}\in \mathbf{R}^{n\times k}$ are the model parameters that have to be estimated. A 2-way FM captures all single and pairwise interactions between variables:

• $w_0$ is the global bias.
• $w_i$ models the strength of the i-th variable.
• $\hat{w}_{i,j} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle$ models the interaction between the i-th and j-th variable.

Instead of using one model parameter $w_{ij}\in \mathbf{R}$ for each interaction, the FM models the interaction by factorizing it. This is the key point which allows high quality parameter estimates of higher-order interactions under sparsity

As we can see, instead of project the user and item into the latent vector space by matrix factorization, factorization machine project each feature into the latent vector space to better handle various features associated with the user, item and context.

## Field-Aware Factorization Machine

Factorization Machine can effectively model the interaction between user and item, as well as the user side and item side features. But what if there are more than 3 dimension? For example, in the CTR prediction for computational advertising, we may have User, Advertisement as well as Publisher. There is interaction not only between the User and Advertisement, or between User and Publisher, but also interaction among all 3 dimensions. As shown above, the 2-way Factorization Machine can only handle interactions between two features. However, the Field-Aware Factorization Machine can handle all such interactions.
The formulation of FFM is

$\min_\mathbf{w} \sum_{i=1}^L(log(1 + exp(-y_i \phi(\mathbf{w}, \mathbf{x}_i))) + \frac{\lambda}{2}||\mathbf{w}||^2$

where

$\phi(\mathbf{w}, \mathbf{x}) = \sum_{j_1, j_2 \in C_2}\langle \mathbf{w}_{j_1, f_2}, \mathbf{w}_{j_2, f_1} \rangle \mathbf{x}_{j_1}\mathbf{x}_{j_2}$

where $f_1, f_2$ are respectively the field of $j_1$ and $j_2$ , and $w_{j_1, f_2}$ and $w_{j_2, f_1}$ are two vectors with length $k$ . Here the field means the dimension, for the recommender system, typically there are 2 dimensions: User and Item. For CTR prediction, there are typically 3 dimensions: User, Advertisement and Publisher. But actually we can build one dimension for each ID features, or even category features.
Here is a concrete example, say there is a sample with 4 features(User, Movie, Gender, Price) and 5 fields(User-Yu, Movie-3I, Gender-Comedy, Gender-Drama, Price).

User(Us) Movie(Mo) Gender(Ge) Price(Pr)
YuChin(Yu) 3Idiots(3I) Comedy, Drama(Co, Dr) \$9.99

Then FFM will create 10 vectors, each corresponds to one (field, feature) pair. The $\phi(w, x)$ in FFM is
$\langle w_{Us-Yu ,Mo} ,w_{ Mo-3I, Us}\rangle x_{Us-Yu} x_{Mo-3I}$
$+ \langle w_{Us-Yu ,Ge} ,w_{ Ge-Co, Us}\rangle x_{Us-Yu} x_{Ge-Co}$
$+ \langle w_{ Us-Yu,Ge}, w_{Ge-Dr, Us}\rangle x_{Us-Yu} x_{Ge-Dr}$
$+ \langle w_{ Us-Yu, Pr}, w_{ Pr, Us}\rangle x_{Us-Yu} x_{Pr}$
$+ \langle w_{Mo-3I , Ge} ,w_{ Ge-Co, Mo}\rangle x_{Mo-3I} x_{Ge-Co}$
$+ \langle w_{Mo-3I ,Ge} ,w_{Ge-Dr , Mo}\rangle x_{Mo-3I} x_{Ge-Dr}$
$+ \langle w_{Mo-3I ,Pr} ,w_{ Pr, Mo}\rangle x_{Mo-3I} x_{Pr}$
$+ \langle w_{Ge-Co ,Ge} ,w_{Ge-Dr , Ge}\rangle x_{Ge-Co} x_{Ge-Dr}$
$+ \langle w_{Ge-Co ,Pr} ,w_{ Pr, Ge}\rangle x_{Ge-Co} x_{Pr}$
$+ \langle w_{Ge-Dr ,Pr} ,w_{Pr , Ge}\rangle x_{Ge-Dr} x_{Pr}$

FFM has shown very promising results in CTR prediction(KDD CUP 2012, 1st).

# Conclusion

In conclusion, Matrix Factorization projects the user and item into a latent vector space to handle interactions between user and item. Factorization Machine projects each feature into a latent vector space to support interactions between various features associated with user, item and context. At last, Field-aware Factorization Machine projects each feature(field) pair into a latent vector space to handle the interactions between more than 3 dimensions.

# Reference:

1. Matrix Factorization Techniques for Recommender System. Y. Koren, et. at.
2. Matrix Factorization and Factorization Machine for Recommender Systems. Chih-Jen Lin.
3. Factorization Machines. Steffen Rendle.
4. 3 Idiots' Approach for Display Advertising Challenge Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
5. Field-aware Factorization Machine Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
6. Pairwise interaction tensor factorization for personalized tag recommendation. Steffen Rendle, Lars Schmidt-Thieme.

Please comment here or send email to pandevirus@gmail for any suggestion.

## Introduction to Question Answering and Machine Reading

Question Answering has always been a challenge in natural language processing. There has been lots of works on question answering based on Information Retrieval, Semantic Parsing and Neural Networks(MemNN). Information Retrieval systems first retrieve a broad set of candidate answers by querying the search API of KBs with a transformation of the question into a valid query and then use fine-grained detection heuristics to identify the exact answer. Semantic parsing methods focus on the correct interpretation of the meaning of a question by a semantic parsing system. Recently, J. Weston et. al. propose to train a neural network(MemNN) with a memory to answer various kinds of questions, and got quite promising results.

# Memory Neural Network(MemNN)

MemNN takes a set of inputs $x_1,...,x_n$ that are to be stored in the memory, a question $q$, and outputs an answer $a$. Each of the $x_i,q,a$ contains words from a vocabulary $V$.
MemNN consists of 3 modules: attention module(input memory representation), reading module (output memory representation) and response generation module:

## Attention Module

Each input $x_i$will be projected into a continuous input vector $m_i$, in the simplest way using embedding matrix $A$ generated by SkipGram or Glove. The question $q$ will also be projected into a continuous vector $u$ by another matrix $B$. Then the MemNN will generate a weight $p_i$for each memory $m_i$. These weights will be used in the reading module to retrieve the input: $p_i=Softmax(u^Tm_i)$

Each input $x_i$ will be projected into a continuous output vector $c_i$ using another embedding matrix $C$. Then the reading module will output a sum of all output vectors $c_i$ weighted by $p_i$: $o=\sum_i p_i c_i$

## Response Generation Module

This module will generate a prediction based on the question vector u as well as the output of the reading module $o: a=Softmax(W(o+u))$.

The overall model is showed in Fig. 1(a).During training, all three embedding matrices $A,B$and $C$ as well as $W$ are jointly learning by minimizing a cross-entropy loss between $\hat{a}$ and the true label $a$.

We can also stack multiple MemNN to get a multiple-layer MemNN by using the sum of the question input and the output of the current layer: $u^{k+1}=u^k+o^k$ as the input to the upper layer, as shown in the Fig 1(b).
During training, all embedding matrices $A,B,C$ as well as $W$ are jointly learned by minimizing the cross-entropy loss between $\hat{a}$ and the true label $a$.

There are several improvements based on MemNN. A. Sordoni et. al. propose a new iterative alternating attention methods which allows a fined-grained exploration of both the question and the document(memory), A. Miller et. al. propose a Key-Value MemNN which makes reading document more viable by utilizing different encoding methods for lookup and reading respectively.

## Iterative Alternating Attention

The iterative alternating attention can be considered as an inference chain which focus on different part of the question and documents at each search step, where the inference chain is modeled by a GRU. In particular, at search step $t$: (1) It first perform an attentive read on the question, resulting a question glimpse $q_t$, and (2) given the query glimpse, it extract a conditional document glimpse $d_t$, representing the part of document that is relevant the current query glimpse. Both attentive reads are based on the previous hidden state of the inference GRU$s_{t-1}$, summarizing the information that has been gathered so far at search step $t$.

### Question Attentive Reading

We denote $q_i$as the encodings for the word $i$ in the question $q, q_{i,t}$as the question attention weights, then the query glimpse$q_t$ at step $t$ can be calculated by:

$q_{i, t} = softmax+{i = 1,\cdots,|Q|}\tilde{q_i}^T(A_q s_{t-1} + a_q)$
$q_t = \sum_i q_{i, t}\tilde{q_i}$

where $A_q\in R^{1\times h}, a_q\in R^{1}, s_{t-1}$ as the previous hidden state of GRU.

### Document Attentive Reading

The notation is very similar with the question attentive reading, note that the document reading weight is also conditioned on the query weight $q_t$, then the document glimpse $d_t$ at step $t$ can be calculated by:

$d{i, t}= softmax_{i=1,\cdots,|D|}\tilde{d_i}^T(A_d[s_{t-1}, q_t] + a_d)$
$d_t=\sum_i d_{i, t}\tilde{d_i}$

## Key-Value MemNN

Key-Value MemNN is a generation of MemNN of the way that the documents are stored in memory. In MemNN, the attention and reading are based on different vectors of the same document using different embedding matrices. MemNN generalizes this by using a key vector for attention and a value vector for reading, while the key and the value can be quite different, for example, we can split the document to windows of $W$ words where the center is an entity. We can use the whole window as key while only the center entity as the value. Then the memory attention and reading can be performed as following.

### Memory Attention

$p_{m_i}= Softmax(A\Phi_Q(q) \cdot A\Phi_K(k_{h_i}))$
Where $\Phi$ are feature maps of dimension $D$, $A\in R^{d\times D}, q$is the question vectors, $k_{m_i}$is the key vector of the memory i

$o = \sum_i p_{m_i}A\Phi_V (v_{m_i})$
Where $v_{m_i}$is the value vector of memory $i$

## Reference:

[1]. J. Weston, S. Chopra, A. Bordes. Memory Networks. arXiv, 2015.
[2]. S. SukhbaatarM A. Szlam, J. Weston, R. Fergus. End-to-End Memory Networks. arXiv, 2015.
[2]. A. Sordoni, P. Bachman, Y. Bengio. Iterative Alternating Neural Attention for Machine Reading.. arXiv, 2016.
[3]. A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, J. Weston. Key-Value Memory Networks for Directly Reading Documents. arXiv, 2016.
[4]. A. Bordes, J. Weston, S. Chopra. Question Answering with Subgraph Embedding. arXiv, 2014.

# Smoothing Techniques for Language Modeling

Language Model predicts the probability that a given string $s$ is a real sentence. There are two most widely-used methods: $n$-gram language models and neural-network based language models. Here we focus on $n$-gram language models, in particular, we focus on $n=2$, that is, the bigram model.

Given a string $s$, which is a sequence of words $w_1,\cdots,w_l$, its probability can be expressed as

$p(s) = p(w_1)p(w_2|w_1)p(w_3|w_1w_2) \cdots p(w_l|w_1, \cdots w_{l-1}) = \prod_{i=1}^lp(w_i|w_1\cdots w_{i-1})\tag{1}$

In bigram models, we make the approximation that the probability of a word depends only on immediately preceding word:

$p(s) = \prod_{i=1}^lp(w_i | w_1 \cdots w_{l-1}) \approx \prod_{i=1}^l p(w_i | w_{i-1})\tag{2}$

The estimate $p(w_i | w_{i-1})$, we can take

$p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)}\tag{3}$

The above equation is called the maximum likelihood estimate of $p(w_i | w_{i-1})$, because the assignment of probabilities yields the bigram model that assigns the highest probability to the training data of all possible bigram models.

# Smoothing

However, there are many bigrams that doesn't occur in the training data, then calculating the probability using $p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)}$ will get 0, which will further result in the zero probability of the whole sentence. This certainly doesn't make sense. This is why we introduce the smoothing methods for $n$-gram models.

The simplest smoothing method would be to pretend each bigram occurs once more that it actually does, that is:

$p(w_i | w_{i-1}) = \frac{1 + c(w_{i-1}w_i)}{\sum_{w_i}[1 + c(w_{i-1}w_i)]} = \frac{1 + c(w_{i-1}w_i)}{|V| + \sum_{w_i}c(w_{i-1}w_i)}\tag{4}$

where $V$ is the vocabulary.

The above method can be further generalized, instead of pretending each n-gram occurs once more that it does, we pretend it occurs $\sigma$ times more:

$p_{add}(w_i|w_{i-n+1}^{i-1})=\frac{\delta + c(w_{i-n+1}^i)}{\delta|V| + \sum_{w_i}c(w_{i-n+1}^i)}\tag{5}$

## Good-Turing Estimate

Good-Turing estimate is used for many other smoothing methods, such as Katz Smoothing. The basic idea is that for any n-gram occurs $r$ times, we should pretent that it occurs $r^*$ times, where:

$r^* = (r+1)\frac{n_{r+1}}{n_r}\tag{6}$

where $n_r$ is the number of n-grams that occur exactly $r$ times in the training data. The count can be converted to a probability by

$p_{GT}(\alpha) = \frac{r^*}{N}\tag{7}$

where $\alpha$ is an n-gram with $r$ counts, and $N=\sum_{r=0}^{\infty}n_r r^*$

## Jelinek-Mercer Smoothing

Both the additive smoothing as well as Good-Turing estimate will give the same probability to an $n$-gram that occurs 0 times in the training data. For example, if we have $c(\text{BURNISH THE}) =0$ and $c(\text{BURNISH THOU})=0$, then we have

$p(THE|BURNISH) = p(THOU|BURNISH)$

according to both Additive Smoothing and Good-Turing Estimate. However, by intuition, we should have

$p(THE|BURNISH) > p(THOU|BURNISH)$

because the word THE is much more common than the word THOU. To capture this behavior, we can interpolate the bigram model with a unigram modelL

$p_{interp}(w_i|w_{i-1})=\lambda p_{ML}(w_i|w_{i-1}) + (1 - \lambda)p_{ML}(w_i)\tag{8}$

In general, it's useful to interpolate higher-order $n$-gram model with lower-order $n$-gram models, because when there is insufficient data to estimate the probability of higher-order model, the lower-order model can often provide useful information. An elegant way of performing this interpolation is:

$p_{interp}(w_i | w_{i-n+1}^{i-1}) = \lambda_{w_{i-n+1}^{i-1}} p_{ML}(w_i | w_{i-n+1}^{i-1}) + (1 - \lambda_{w_{i-n+1}^{i-1}}) p_{interp}(w_i |w_{i-n+2}^{i-1})\tag{9}$

To end the recursion, we can take the smoothed 1st-order model to be the maximum likelihood distribution, or we can take the smoothed 0th-order model to be the uniform distrobution:

$p_{unif}(w_i)=\frac{1}{|V|}\tag{10}$

## Katz Smoothing

Katz Smoothing extends the Good-Turing estimate by interpolating higher-order models with lower-order models, the Katz Smoothing for bigram models is:

$c_{katz}(w_{i-1}^i)= \begin{cases} d_r r & \text{if} r \gt 0 \\ \alpha(w_{i-1})p_{ML}(w_i) & \text{if} r = 0\end{cases}\tag{11}$

All bigrams with a nonzero count $r$ are discounted according to a discount ratio $d_r$, which is approximately $\frac{r^*}{r}$. The counts subtracted from the nonzero counts are then distributed among the zero-count bigrams according to the next lower-order distribution. The $\alpha(w_{i-1})$ is chosen so that the total number of counts in the distribution $\sum_{w_i}c_{katz}(w_{i-1}^i)$ is unchanged, i.e., $\sum_{w_i}c_{katz}(w_{i-1}^i)=\sum_{w_i}c(w_{i-1}^i)$. The appropriate value for $\alpha(w_{i-1})$ is

$\alpha(w_{i-1})=\frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{\sum_{w_i: c(w_{i-1}^i=0)}p_{ML}(w_i)} = \frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{1 - \sum_{w_i: c(w_{i-1}^i\gt0)}p_{ML}(w_i)}\tag{12}$.

The probability can be got by normalization:

$p_{katz}(w_i | w_{i-1})=\frac{c_{katz}(w_{i-1}^i)}{\sum_{w_i}c_{katz}(w_{i-1}^i)}\tag{13}$

The $d_r$ is calculated so that: 1) Large counts $r > k$ should be considered reliable, so that they should not be discounted, where Katz suggest $k = 5$. 2) Small counts $r \le k$ are derived from the Good-Turing estimate. The $d_r$ for small counts is chosen such that the resulting discounts are proportional to the discounts predicted by Good-Turing Estimate, while at the same time the total number of counts discounted in the global bigram distribution is equal to the total number of counts that should be assigned to bigrams with zero counts.

The first constraint results in

$1 - d_r = \mu (1 - \frac{r^*}{r})\tag{14}$

for $r \in {1, ..., k}$. The Good-Turing estimate predicts that the total number of counts that should be assigned to bigrams with zero counts is $n_0 0^* = n_0 \frac{n_1}{n_0} = n_1$, so the second constraint is

$\sum_{r=1}^k n_r(1 - d_r)r = n_1\tag{15}$

The unique solution to these equations is

$d_r = \frac{\frac{r^*}{r}-\frac{(k+1)n_{k+1}}{n_1}}{1 - \frac{(k+1)n_{k+1}}{n_1}}\tag{16}$

## Witten-Bell Smoothing

Witten-Bell can be considered as an instance of Jelinek-Mercer smoothing:

$p_{WB}(w_i|w_{i-n+1}^{i-1})=\lambda_{w_{i-n+1}^{i-1}}p_{ML}(w_i, w_{i-n+1}^{i-1}) + (1-\lambda_{w_{i-n+1}^{i-1}})p_{WB}(w_i | w_{i-n+2}^{i-1})\tag{17}$

where $1-\lambda_{w_{i-n+1}^{i-1}}=\frac{N_{1+}(w_{i-n+1}^{i-1}\bullet)}{N_{1+}(w_{i-n+1}^{i-1}\bullet)+\sum_{w_i}c(w_{i-n+1}^i)}$

where $N_{1+}(w_{i-n+1}^{i-1}\bullet)=|{w_i:c(w_{i-n+1}^{i-1}) \gt 0}|$

The $1-\lambda_{w_{i-n+1}^{i-1}}$ can be regarded as the probability that a word not observed in the training data occurs after that history.

## Absolute Discounting

Absolute Discounting also involves interpolation of higher- and lower- order models. However, instead of multiplying the higher-order maximum-likelihood distribution by a factor $\lambda_{w_{i-n_+1}^{i-1}}$, the discount is done by subtracting a fixed discount $D \le 1$ for each nonzero count:

$p_{abs}(w_i | w_{i-n+1}^{i-1})=\frac{max{c(w_{i-n+1}^{i-1}) - D, 0}}{\sum_{w_i}c(w_{i-n+1}^i}) + (1- \lambda_{i-n+1}^{i-1})p_{abs}(w_i | w_{i-n+2}^{i-1})\tag{18}$

where

$1- \lambda_{i-n+1}^{i-1}=\frac{D}{\sum_{w_i} c(w_{i-n+1}^i)}N_{1+}(w_{i-n+1}^{i-1}\bullet)$

The value of $D$ could be $D=\frac{n_1}{n_1 + 2n_2}$

## Kneser-Ney Smoothing

Consider a conventional bigram model of a phrase such as $P_{Katz}(Francisco|on)$, since the phrase San Francisco is quite common, so the conventional unigram probability $\frac{C(Francisco)}{\sum_w C(w)}$ is quite high. Then for another phrase consists of Francisco, if we use Katz Smoothing, then the probability

$P_{Katz}(\text{on Francisco}) = \begin{cases} \frac{disc(C(\text{on Francisco}))}{c(on)} & \text{if} C(\text{on Francisco}) \gt 0 \\ \alpha(on) \times P_{Katz}(Francisco) & \text{otherwise} \end{cases} = \alpha(on) \times P_{Katz}(Francisco)\tag{19}$

will also be quite high, while the word Francisco seldom occurs behind on. The Kneser-Ney smoothing use a modified backoff distribution based on the number of context each word occurs in, rather than the number of occurrences of the word:

$P_{KN}(w_i | w_{i-1}) = \begin{cases} \frac{C(w_{i-1} - D)}{C(w_{i-1})} & \text{if } C(w_{i-1}w_i) \gt 0 \\ \alpha(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w|{v|C(vw) | gt 0}|} & \text{otherwise}\end{cases}\tag{20}$

where $\alpha$ is a normalization constant such that the probability sum to 1.

Goodman also propose an Interpolated Kneser-Ney smoothing:

$P_{IKN}(w_i | w_{i-1}) = \frac{C(w_{i-1}w_i) - D}{C(w_{i-1})} + \lambda(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w |{v|C(vw)\gt 0}|}\tag{21}$

where $\lambda(w_{i-1})$ is a normalization constant such that the probabilities sum to 1.

Reference:
[1]. An Empirical Study of Smoothing Techniques for Language Modeling. S. Chen, J. Goodman. Technical Report. 1998.
[2]. A Bit of Progress in Language Modeling Extended Version. J. Goodman. Technical Report. 2001.

# Introduction

The early work on representing words in a vector space is exploited in Information Retrieval as a side effect, where the feature vectors of words is learned based on the word-document co-occurrence(LSI)[1]. Then Y. Bengio et. al. proposed a neural network based model(NNLM, [2]) that learns the word vectors and the language model at the same time. Mikolov propose the CBOW and Skip-Gram model[3] which simplifies the NNLM by removing the hidden layer, and accelerate the training by Hierarchical SoftMax or Negative Sampling[4]. Finally J. Pennington propose the GloVe[5] which combines the global matrix factorization methods(LSI) and local context window methods(Skip-Gram).

## Latent Semantic Indexing

LSI first arise in the information retrieval literature to match documents for each query, it does a matrix factorization based on the word-document co-occurrence matrix$X$:

$X X = T_0 \cdot S_0 \cdot D_0' \tag{1}$

where $X \in \mathbb{R}^{v \cdot d}, T_0 \in \mathbb{R}^{v \cdot v}, S_0 \in \mathbb{R}^{v \cdot d}, D_0 \in \mathbb{R}^{d \cdot d}$, and $S_0$ is a rectangular diagonal matrix.

If we only keep the $k$ largest value of the $S_0$ matrix, then we can get:

$X \approx \tilde X = T \cdot S \cdot D' \tag{2}$

Where $T \in \mathbb{R}^{v \cdot k}, S \in \mathbb{R}^{k \cdot k}, D \in \mathbb{R}^{d \cdot k}$

So each row of the $T$ is the word embedding vector for each word.

## Neural Net Language Model(NNLM)

Y. Bengio et. al. propose a probabilistic feedforward neural network language model, which learns the language model as well as word embeddings at the same time. The network consists of input, hidden and output layers, as below:

The NNLM learns the language model as well as word embedding by go through the training data using a window of size $N$, at each time the NNLM predict the last word in the window based on the first $N-1$ words. So the input is a $(N-1) \times D$-dimension vector, which is the concatenation of the word vectors of the first $N-1$ words. The hidden layer consists of $H$ hidden nodes, with sigmoid or tanh activation function. The output layer is a softmax layer with dimension $V$, e.g., the number of unique words.

## Skip-Gram & CBow

While the NNLM can learn meaningful word embeddings, it's computational complexity is quite high, e.g. for each training sample(window), the model must learn $Q = (N - 1) \times D + (N - 1) \times D \times H + H \times V$ parameters. T. Mikolov propose two more efficient neural network architectures which can learn the word embedding much faster: CBoW and Skip-Gram.

The main difference between these two new methods and the NNLM is that, there is no hidden layer any more. The COBW model predict a word based on the context words(words around the target word), while the Skip-Gram predict the context words based on the current word, as below:

## GloVe

The GloVe looks very similar with the LSA, but is explicitly designed to model semantic and syntactic regularities. Instead of factorize the document-word co-occurrence, the GloVe models the logarithm of the co-occurrence of a word and its context words as:

$w_i^T\tilde w_k + b_i + \tilde b_k = log(X_{ij})\tag{3}$

where $w_i$ and $\tilde w_k$ is the word embeddings of a word and its context word, while $b_i$ and $\tilde b_k$ are the corresponding biases. The $X_{ik}$ is their co-occurrence of word $k$ appears in the context of word $i$.

So the loss function of the GloVe is:

$J = \sum_{i, j=1}^Vf(X_{ij})(w_i^T \tilde w_j + b_i+ \tilde b_j - log X_{ij})^2\tag{4}$

where $f$ is a weighting function:

$f(x) = \begin{cases} (\frac{x}{x_{max}})^{\alpha} & \text{if } x \lt x_{max} \\ 1 \text{otherwise}\end{cases}\tag{5}$

# Why does it works?

J. Pennington et. at have done some interesting analysis of word embedding algorithms. They classify these algorithms to Global Matrix Factorization Methods such as LSA and Local Context Window Methods such as NNLM, CBOW and Skip-Gram, while at last they propose the GloVe algorithm which combines the advantage of both.

Global Matrix Factorization Methods utilize the global statistics information of the corpus. For example, the LSA does the matrix factorization on the document-word co-occurrence matrix to get the word vectors. However, LSA performs poor on the word analogy task.

Local Context Window Methods are trained on the local context windows instead of the statistics of the whole corpus and perform good on the word analogy task. However, P. Pennington proved that the Skip-Gram is just an on-line version of the GloVe.

The Skip-Gram models the probability that word $j$ appears in the context of word $i$ as

$Q_{ij} = \frac{exp(w_i^T\tilde w_j)}{\sum_{k=1}^Vexp(w_i^T\tilde w_k)}\tag{6}$

Skip-Gram is trained in an on-line, stochastic way, but the implied global objective function can be written as

$J = -\sum_{i\in corpus, j\in context(i)}logQ_{ij}\tag{7}$

The above equation can be evaluated much more efficiently if we group those terms that have the same values for $i$ and $j$:

$J = -\sum_{i=1}^Vsum_{j=1}^V X_{ij}logQ_{ij}\tag{8}$

Since $X_i = \sum_k X_{ik}$ and $P_{ij} = \frac{X_{ij}}{X_i}$, so $J$ can be rewritten as

$J = -\sum_{i=1}^V X_i \sum_{j=1}^VP_{ij}log Q_{ij} = \sum_{i=1}^VX_iH(P_i, Q_i)\tag{9}$

We can find that the above equation $(9)$ is very similar with $(4)$. While the Skip-Gram uses the cross entropy error while GloVe use the lease square error.

## Intuitive Interpretation

So why does the above word embedding methods can perform so good on the word analogy task? Is there any intuitive interpretation?

Absolutely yes, here we take Skip-Gram as example, since the other methods are very similar with it. Say that the word pear and orange is very similar. For two words to be similar, we mean that they often appears in the same context. For the pear-orange case, a very common context can be She eat some pear/orange, or I want a glass of pear/orange juice. Since they often share the same context, so in training, their word vectors should be very similar with each other so that they can predict their common context words well.

Another very interesting property of the word embedding is that the offset of the word vectors is also meaning full, for example, $x_{queen} - x_{woman} \approx x_{king} - x_{man}$. An intuitive interpretation is that, we can think each word as a compose of concepts, e.g. $x_{king} = x_{man} + x_{ruler} + x_{royalty} + x_{human} + x_{adult} + \cdots$. So $x_{queen} - [\text{concept of woman}] + [\text{concept of man}] \approx x_{king}$

# References:

[1]. Indexing by latent semantic analysis. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. JASIS. 1990.
[2]. A Neural Probabilistic Language Model. Y. Bengio, R. Ducharme, P. Voncent, C. Jauvin. JMLR. 2003.
[3]. Efficient Estimation of Word Representation in Vector Space. T. Mikolov, K. Chen, G. Corrado, J. Dean. arXiv:1301.3781v3
[4]. Distributed Representations of Words and Phrases and their Compositionality. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. arXiv:1310.4546v1.
[5]. GloVe: Global Vectors for Word Representation. J. Pennington, R. Socher, C. D. Manning.