From Matrix Factorization to Factorization Machines

Matrix Factorization

In Recommender System, Matrix Factorization maps both the users and items to a joint latent factor space of dimension k, such that the user-item interaction can be modeled as the inner product in this space. That is to say, we map each user u to a vector \mathbf{p}_u\in \mathbb{R}^k, and each item i to a vector \mathbf{q}_i \in \mathbb{R}^k. For movie recommendation, each dimension of the latent factor space can be explained as a topic, say comedy v.s. drama, or other features such as amount of action, orientation to children and so on.
Given the latent vector for user u and item i, we can predict the interaction between them as


The major challenge is to compute the latent vector for each user and item. This is quite similar with Singular Value Decomposition(SVD), a well-established techniques for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by the sparseness in the user-item rating matrix. One approach is to rely on imputation to fill in missing ratings to make the matrix dense. However, this will significantly increase the amount of data, and the inaccurate imputation might distort the data.

Matrix Factorization is a method which focus only on the observed ratings only, while avoid overfitting by regularization. Here is the cost function for matrix factorization

\min_{\mathbf{q}^*, \mathbf{p}^*}\sum_{(u, i) \in \kappa}(r_{ui} - \mathbf{q}_i^T\mathbf{p}_u)^2 + \lambda (||\mathbf{q}_i||^2 + ||\mathbf{p}_u||^2)\tag{2}

Here \kappa is the set of (u, i) pairs for which r_{ui} is known in the training set and  \lambda (||\mathbf{q}_i||^2 + ||\mathbf{p}_u||^2) is the regularization term.

Several learners have been proposed to optimize Matrix Factorization, including SGD, Coordinate Descent and MCMC.

Optimization by SGD

SGD loops through all ratings in the training set. For each rating r_{ui} in the training set, let  e_{ui} \stackrel{\text{def}}{=} r_{ui} - \mathbf{q}_i^T \mathbf{p}_u . Then it modifies the parameter in the opposite direction of the gradient yielding:

\mathbf{q}_i \gets \mathbf{q}_i + \gamma(e_{ui} \mathbf{p}_u -\lambda \mathbf{q}_i)
\mathbf{p}_u \gets \mathbf{p}_u + \gamma (e_{ui}\mathbf{q}_i -\lambda \mathbf{p}_u)

Optimization by ALS

Alternating Least Square is a popular approach to optimize regression models such as MF. It works by iteratively optimizing one parameter, while leaving others fixed. The prerequisite of ALS is that the optimization sub-problem can be analytically solved.

First, minimize J with respect to user latent vector \mathbf{p}_u is equivalent to minimizing:

J_u = ||(r_u-\mathbf{Qp}_u)||^2 + \lambda ||\mathbf{p}_u||^2

The minimum is where the first order derivative is 0:

\frac{\partial J_u}{\partial \mathbf{p}_u} = 2\mathbf{Q}^T\mathbf{Qp}_u - 2\mathbf{Q}^Tr_u + 2\lambda \mathbf{p}_u = 0
\Rightarrow \mathbf{p}_u = (\mathbf{Q}^T\mathbf{Q} + \lambda \mathbf{I})^{-1}\mathbf{Q}^Tr_u

Following the same process, we can get the solution for \mathbf{q}_i.

Adding Biases

The (2) only interpret the rating r_{ui} as an interaction between the user u and item i, but in the fact, the rating values can also due to effects associated with either users or items. For example, in the recommender system, some user tend to give higher rating than others, or some items is in general better than others. To consider all such effects, we can add a bias term to the (1)

\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T\mathbf{p}_u\tag{3}

and the corresponding cost function is

\min_{\mathbf{p},\mathbf{q},b}\sum_{(u, i)\in \kappa}(r_{ui}-\mu-b_u-b_i-\mathbf{p}_u^T\mathbf{q}_i)^2 + \lambda(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2 + b_u^2 + b_i^2)\tag{4}

Handle user and item features

Matrix factorization can also handle user and item features beside the rating information. For example, if we also have the implicit feedback such as the purchase or browsing history, as well as some user attributes.
We denote N(u) as the sets of items for which user u has expressed an implicit feedback, where each item i is associated with \mathbf{x}_i \in \mathbb{R}^f. So a user who showed a preference for items in N(u) is characterized by

|N(u)|^{-0.5}\sum_{i\in N(u)}\mathbf{x}_i\tag{5}

Denote A(u) as the set of attributes of a user u, and a factor vector \mathbf{y}^a \in \mathbb{R}^f corresponds to each attribute to describe a user through the set of user-associated attributes:

\sum_{a\in A(u)}\mathbf{y}^a\tag{6}

Then the predicted rating can be modeled by

\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T[\mathbf{p}_u + |N(u)|^{-0.5}\sum_{i \in N(u)}\mathbf{x}_i + \sum_{a\in A(u)}\mathbf{y}^a]\tag{7}

Factorization Machines

Although Matrix Factorization can model such kind of implicit feedback and user or item attribute features, Factorization Machine can handle such kind of features more directly. For each rating sample with user features(e.g. age, gender, rating history), item features(e.g. genders, price) as well as context features(e.g. time) , we can generate a feature vector for each of these features and then build a factorization machine of degree d=2 to model the interactions between them:

\hat{y}(x) = w_0 + \sum_{i=1}^n \mathbf{w}_i \mathbf{x}_i + \sum_{i=1}^n\sum_{j=i+1}^n\langle \mathbf{v}_i, \mathbf{v}_j\rangle \mathbf{x}_i\mathbf{x}_j\tag{8}

where w_0 \in \mathbb{R}, \mathbf{w}\in \mathbf{R}^n, \mathbf{V}\in \mathbf{R}^{n\times k} are the model parameters that have to be estimated. A 2-way FM captures all single and pairwise interactions between variables:

  • w_0 is the global bias.
  • w_i models the strength of the i-th variable.
  • \hat{w}_{i,j} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle models the interaction between the i-th and j-th variable.

Instead of using one model parameter w_{ij}\in \mathbf{R} for each interaction, the FM models the interaction by factorizing it. This is the key point which allows high quality parameter estimates of higher-order interactions under sparsity

As we can see, instead of project the user and item into the latent vector space by matrix factorization, factorization machine project each feature into the latent vector space to better handle various features associated with the user, item and context.

Field-Aware Factorization Machine

Factorization Machine can effectively model the interaction between user and item, as well as the user side and item side features. But what if there are more than 3 dimension? For example, in the CTR prediction for computational advertising, we may have User, Advertisement as well as Publisher. There is interaction not only between the User and Advertisement, or between User and Publisher, but also interaction among all 3 dimensions. As shown above, the 2-way Factorization Machine can only handle interactions between two features. However, the Field-Aware Factorization Machine can handle all such interactions.
The formulation of FFM is

\min_\mathbf{w} \sum_{i=1}^L(log(1 + exp(-y_i \phi(\mathbf{w}, \mathbf{x}_i))) + \frac{\lambda}{2}||\mathbf{w}||^2


\phi(\mathbf{w}, \mathbf{x}) = \sum_{j_1, j_2 \in C_2}\langle \mathbf{w}_{j_1, f_2}, \mathbf{w}_{j_2, f_1} \rangle \mathbf{x}_{j_1}\mathbf{x}_{j_2}

where f_1, f_2 are respectively the field of j_1 and j_2 , and w_{j_1, f_2} and w_{j_2, f_1} are two vectors with length k . Here the field means the dimension, for the recommender system, typically there are 2 dimensions: User and Item. For CTR prediction, there are typically 3 dimensions: User, Advertisement and Publisher. But actually we can build one dimension for each ID features, or even category features.
Here is a concrete example, say there is a sample with 4 features(User, Movie, Gender, Price) and 5 fields(User-Yu, Movie-3I, Gender-Comedy, Gender-Drama, Price).

User(Us) Movie(Mo) Gender(Ge) Price(Pr)
YuChin(Yu) 3Idiots(3I) Comedy, Drama(Co, Dr) $9.99

Then FFM will create 10 vectors, each corresponds to one (field, feature) pair. The \phi(w, x) in FFM is
\langle w_{Us-Yu ,Mo} ,w_{ Mo-3I, Us}\rangle x_{Us-Yu} x_{Mo-3I}
+ \langle w_{Us-Yu ,Ge} ,w_{ Ge-Co, Us}\rangle x_{Us-Yu} x_{Ge-Co}
 + \langle w_{ Us-Yu,Ge}, w_{Ge-Dr, Us}\rangle x_{Us-Yu} x_{Ge-Dr}
 + \langle w_{ Us-Yu, Pr}, w_{ Pr, Us}\rangle x_{Us-Yu} x_{Pr}
 + \langle w_{Mo-3I , Ge} ,w_{ Ge-Co, Mo}\rangle x_{Mo-3I} x_{Ge-Co}
 + \langle w_{Mo-3I ,Ge} ,w_{Ge-Dr , Mo}\rangle x_{Mo-3I} x_{Ge-Dr}
 + \langle w_{Mo-3I ,Pr} ,w_{ Pr, Mo}\rangle x_{Mo-3I} x_{Pr}
 + \langle w_{Ge-Co ,Ge} ,w_{Ge-Dr , Ge}\rangle x_{Ge-Co} x_{Ge-Dr}
 + \langle w_{Ge-Co ,Pr} ,w_{ Pr, Ge}\rangle x_{Ge-Co} x_{Pr}
 + \langle w_{Ge-Dr ,Pr} ,w_{Pr , Ge}\rangle x_{Ge-Dr} x_{Pr}

FFM has shown very promising results in CTR prediction(KDD CUP 2012, 1st).


In conclusion, Matrix Factorization projects the user and item into a latent vector space to handle interactions between user and item. Factorization Machine projects each feature into a latent vector space to support interactions between various features associated with user, item and context. At last, Field-aware Factorization Machine projects each feature(field) pair into a latent vector space to handle the interactions between more than 3 dimensions.


  1. Matrix Factorization Techniques for Recommender System. Y. Koren, et. at.
  2. Matrix Factorization and Factorization Machine for Recommender Systems. Chih-Jen Lin.
  3. Factorization Machines. Steffen Rendle.
  4. 3 Idiots' Approach for Display Advertising Challenge Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
  5. Field-aware Factorization Machine Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
  6. Pairwise interaction tensor factorization for personalized tag recommendation. Steffen Rendle, Lars Schmidt-Thieme.

Please comment here or send email to pandevirus@gmail for any suggestion.

Introduction to Question Answering and Machine Reading

Question Answering has always been a challenge in natural language processing. There has been lots of works on question answering based on Information Retrieval, Semantic Parsing and Neural Networks(MemNN). Information Retrieval systems first retrieve a broad set of candidate answers by querying the search API of KBs with a transformation of the question into a valid query and then use fine-grained detection heuristics to identify the exact answer. Semantic parsing methods focus on the correct interpretation of the meaning of a question by a semantic parsing system. Recently, J. Weston et. al. propose to train a neural network(MemNN) with a memory to answer various kinds of questions, and got quite promising results.

Memory Neural Network(MemNN)

MemNN takes a set of inputs x_1,...,x_n that are to be stored in the memory, a question q, and outputs an answer a. Each of the x_i,q,a contains words from a vocabulary V.
MemNN consists of 3 modules: attention module(input memory representation), reading module (output memory representation) and response generation module:

Attention Module

Each input x_iwill be projected into a continuous input vector m_i, in the simplest way using embedding matrix A generated by SkipGram or Glove. The question q will also be projected into a continuous vector u by another matrix B. Then the MemNN will generate a weight p_ifor each memory m_i. These weights will be used in the reading module to retrieve the input: p_i=Softmax(u^Tm_i)

Reading Module

Each input x_i will be projected into a continuous output vector c_i using another embedding matrix C. Then the reading module will output a sum of all output vectors c_i weighted by p_i: o=\sum_i p_i c_i

Response Generation Module

This module will generate a prediction based on the question vector u as well as the output of the reading module o: a=Softmax(W(o+u)).

The overall model is showed in Fig. 1(a).During training, all three embedding matrices A,Band C as well as W are jointly learning by minimizing a cross-entropy loss between \hat{a} and the true label a.

We can also stack multiple MemNN to get a multiple-layer MemNN by using the sum of the question input and the output of the current layer: u^{k+1}=u^k+o^k as the input to the upper layer, as shown in the Fig 1(b).
During training, all embedding matrices A,B,C as well as W are jointly learned by minimizing the cross-entropy loss between \hat{a} and the true label a.

There are several improvements based on MemNN. A. Sordoni et. al. propose a new iterative alternating attention methods which allows a fined-grained exploration of both the question and the document(memory), A. Miller et. al. propose a Key-Value MemNN which makes reading document more viable by utilizing different encoding methods for lookup and reading respectively.

Iterative Alternating Attention

The iterative alternating attention can be considered as an inference chain which focus on different part of the question and documents at each search step, where the inference chain is modeled by a GRU. In particular, at search step t: (1) It first perform an attentive read on the question, resulting a question glimpse q_t, and (2) given the query glimpse, it extract a conditional document glimpse d_t, representing the part of document that is relevant the current query glimpse. Both attentive reads are based on the previous hidden state of the inference GRUs_{t-1}, summarizing the information that has been gathered so far at search step t.

Question Attentive Reading

We denote q_ias the encodings for the word i in the question q, q_{i,t}as the question attention weights, then the query glimpseq_t at step t can be calculated by:

q_{i, t} = softmax+{i = 1,\cdots,|Q|}\tilde{q_i}^T(A_q s_{t-1} + a_q)
q_t = \sum_i q_{i, t}\tilde{q_i}

where A_q\in R^{1\times h}, a_q\in R^{1}, s_{t-1} as the previous hidden state of GRU.

Document Attentive Reading

The notation is very similar with the question attentive reading, note that the document reading weight is also conditioned on the query weight q_t, then the document glimpse d_t at step t can be calculated by:

d{i, t}= softmax_{i=1,\cdots,|D|}\tilde{d_i}^T(A_d[s_{t-1}, q_t] + a_d)
d_t=\sum_i d_{i, t}\tilde{d_i}

Key-Value MemNN

Key-Value MemNN is a generation of MemNN of the way that the documents are stored in memory. In MemNN, the attention and reading are based on different vectors of the same document using different embedding matrices. MemNN generalizes this by using a key vector for attention and a value vector for reading, while the key and the value can be quite different, for example, we can split the document to windows of W words where the center is an entity. We can use the whole window as key while only the center entity as the value. Then the memory attention and reading can be performed as following.

Memory Attention

p_{m_i}= Softmax(A\Phi_Q(q) \cdot A\Phi_K(k_{h_i}))
Where \Phi are feature maps of dimension D, A\in R^{d\times D}, qis the question vectors, k_{m_i}is the key vector of the memory i

Memory Reading

o = \sum_i p_{m_i}A\Phi_V (v_{m_i})
Where v_{m_i}is the value vector of memory i


[1]. J. Weston, S. Chopra, A. Bordes. Memory Networks. arXiv, 2015.
[2]. S. SukhbaatarM A. Szlam, J. Weston, R. Fergus. End-to-End Memory Networks. arXiv, 2015.
[2]. A. Sordoni, P. Bachman, Y. Bengio. Iterative Alternating Neural Attention for Machine Reading.. arXiv, 2016.
[3]. A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, J. Weston. Key-Value Memory Networks for Directly Reading Documents. arXiv, 2016.
[4]. A. Bordes, J. Weston, S. Chopra. Question Answering with Subgraph Embedding. arXiv, 2014.

Smoothing Techniques for Language Modeling

Smoothing Techniques for Language Modeling

Language Model predicts the probability that a given string s is a real sentence. There are two most widely-used methods: n-gram language models and neural-network based language models. Here we focus on n-gram language models, in particular, we focus on n=2, that is, the bigram model.

Given a string s, which is a sequence of words w_1,\cdots,w_l, its probability can be expressed as

p(s) = p(w_1)p(w_2|w_1)p(w_3|w_1w_2) \cdots p(w_l|w_1, \cdots w_{l-1}) = \prod_{i=1}^lp(w_i|w_1\cdots w_{i-1})\tag{1}

In bigram models, we make the approximation that the probability of a word depends only on immediately preceding word:

p(s) = \prod_{i=1}^lp(w_i | w_1 \cdots w_{l-1}) \approx \prod_{i=1}^l p(w_i | w_{i-1})\tag{2}

The estimate p(w_i | w_{i-1}), we can take

p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)}\tag{3}

The above equation is called the maximum likelihood estimate of p(w_i | w_{i-1}), because the assignment of probabilities yields the bigram model that assigns the highest probability to the training data of all possible bigram models.


However, there are many bigrams that doesn't occur in the training data, then calculating the probability using p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)} will get 0, which will further result in the zero probability of the whole sentence. This certainly doesn't make sense. This is why we introduce the smoothing methods for n-gram models.

Additive(Laplace) Smoothing

The simplest smoothing method would be to pretend each bigram occurs once more that it actually does, that is:

p(w_i | w_{i-1}) = \frac{1 + c(w_{i-1}w_i)}{\sum_{w_i}[1 + c(w_{i-1}w_i)]} = \frac{1 + c(w_{i-1}w_i)}{|V| + \sum_{w_i}c(w_{i-1}w_i)}\tag{4}

where V is the vocabulary.

The above method can be further generalized, instead of pretending each n-gram occurs once more that it does, we pretend it occurs \sigma times more:

p_{add}(w_i|w_{i-n+1}^{i-1})=\frac{\delta + c(w_{i-n+1}^i)}{\delta|V| + \sum_{w_i}c(w_{i-n+1}^i)}\tag{5}

Good-Turing Estimate

Good-Turing estimate is used for many other smoothing methods, such as Katz Smoothing. The basic idea is that for any n-gram occurs r times, we should pretent that it occurs r^* times, where:

r^* = (r+1)\frac{n_{r+1}}{n_r}\tag{6}

where n_r is the number of n-grams that occur exactly r times in the training data. The count can be converted to a probability by

p_{GT}(\alpha) = \frac{r^*}{N}\tag{7}

where \alpha is an n-gram with r counts, and N=\sum_{r=0}^{\infty}n_r r^*

Jelinek-Mercer Smoothing

Both the additive smoothing as well as Good-Turing estimate will give the same probability to an n-gram that occurs 0 times in the training data. For example, if we have c(\text{BURNISH THE}) =0 and c(\text{BURNISH THOU})=0, then we have


according to both Additive Smoothing and Good-Turing Estimate. However, by intuition, we should have

p(THE|BURNISH)  data-recalc-dims= p(THOU|BURNISH)" />

because the word THE is much more common than the word THOU. To capture this behavior, we can interpolate the bigram model with a unigram modelL

p_{interp}(w_i|w_{i-1})=\lambda p_{ML}(w_i|w_{i-1}) + (1 - \lambda)p_{ML}(w_i)\tag{8}

In general, it's useful to interpolate higher-order n-gram model with lower-order n-gram models, because when there is insufficient data to estimate the probability of higher-order model, the lower-order model can often provide useful information. An elegant way of performing this interpolation is:

p_{interp}(w_i | w_{i-n+1}^{i-1}) = \lambda_{w_{i-n+1}^{i-1}} p_{ML}(w_i | w_{i-n+1}^{i-1}) + (1 - \lambda_{w_{i-n+1}^{i-1}}) p_{interp}(w_i |w_{i-n+2}^{i-1})\tag{9}

To end the recursion, we can take the smoothed 1st-order model to be the maximum likelihood distribution, or we can take the smoothed 0th-order model to be the uniform distrobution:


Katz Smoothing

Katz Smoothing extends the Good-Turing estimate by interpolating higher-order models with lower-order models, the Katz Smoothing for bigram models is:

c_{katz}(w_{i-1}^i)= \begin{cases} d_r r & \text{if} r \gt 0 \\ \alpha(w_{i-1})p_{ML}(w_i) & \text{if} r = 0\end{cases}\tag{11}

All bigrams with a nonzero count r are discounted according to a discount ratio d_r, which is approximately \frac{r^*}{r}. The counts subtracted from the nonzero counts are then distributed among the zero-count bigrams according to the next lower-order distribution. The \alpha(w_{i-1}) is chosen so that the total number of counts in the distribution \sum_{w_i}c_{katz}(w_{i-1}^i) is unchanged, i.e., \sum_{w_i}c_{katz}(w_{i-1}^i)=\sum_{w_i}c(w_{i-1}^i). The appropriate value for \alpha(w_{i-1}) is

\alpha(w_{i-1})=\frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{\sum_{w_i: c(w_{i-1}^i=0)}p_{ML}(w_i)} = \frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{1 - \sum_{w_i: c(w_{i-1}^i\gt0)}p_{ML}(w_i)}\tag{12}.

The probability can be got by normalization:

p_{katz}(w_i | w_{i-1})=\frac{c_{katz}(w_{i-1}^i)}{\sum_{w_i}c_{katz}(w_{i-1}^i)}\tag{13}

The d_r is calculated so that: 1) Large counts r  data-recalc-dims= k" /> should be considered reliable, so that they should not be discounted, where Katz suggest k = 5. 2) Small counts r \le k are derived from the Good-Turing estimate. The d_r for small counts is chosen such that the resulting discounts are proportional to the discounts predicted by Good-Turing Estimate, while at the same time the total number of counts discounted in the global bigram distribution is equal to the total number of counts that should be assigned to bigrams with zero counts.

The first constraint results in

1 - d_r = \mu (1 - \frac{r^*}{r})\tag{14}

for r \in {1, ..., k}. The Good-Turing estimate predicts that the total number of counts that should be assigned to bigrams with zero counts is n_0 0^* = n_0 \frac{n_1}{n_0} = n_1, so the second constraint is

\sum_{r=1}^k n_r(1 - d_r)r = n_1\tag{15}

The unique solution to these equations is

d_r = \frac{\frac{r^*}{r}-\frac{(k+1)n_{k+1}}{n_1}}{1 - \frac{(k+1)n_{k+1}}{n_1}}\tag{16}

Witten-Bell Smoothing

Witten-Bell can be considered as an instance of Jelinek-Mercer smoothing:

p_{WB}(w_i|w_{i-n+1}^{i-1})=\lambda_{w_{i-n+1}^{i-1}}p_{ML}(w_i, w_{i-n+1}^{i-1}) + (1-\lambda_{w_{i-n+1}^{i-1}})p_{WB}(w_i | w_{i-n+2}^{i-1})\tag{17}

where 1-\lambda_{w_{i-n+1}^{i-1}}=\frac{N_{1+}(w_{i-n+1}^{i-1}\bullet)}{N_{1+}(w_{i-n+1}^{i-1}\bullet)+\sum_{w_i}c(w_{i-n+1}^i)}

where N_{1+}(w_{i-n+1}^{i-1}\bullet)=|{w_i:c(w_{i-n+1}^{i-1}) \gt 0}|

The 1-\lambda_{w_{i-n+1}^{i-1}} can be regarded as the probability that a word not observed in the training data occurs after that history.

Absolute Discounting

Absolute Discounting also involves interpolation of higher- and lower- order models. However, instead of multiplying the higher-order maximum-likelihood distribution by a factor \lambda_{w_{i-n_+1}^{i-1}}, the discount is done by subtracting a fixed discount D \le 1 for each nonzero count:

p_{abs}(w_i | w_{i-n+1}^{i-1})=\frac{max{c(w_{i-n+1}^{i-1}) - D, 0}}{\sum_{w_i}c(w_{i-n+1}^i}) + (1- \lambda_{i-n+1}^{i-1})p_{abs}(w_i | w_{i-n+2}^{i-1})\tag{18}


1- \lambda_{i-n+1}^{i-1}=\frac{D}{\sum_{w_i} c(w_{i-n+1}^i)}N_{1+}(w_{i-n+1}^{i-1}\bullet)

The value of D could be D=\frac{n_1}{n_1 + 2n_2}

Kneser-Ney Smoothing

Consider a conventional bigram model of a phrase such as P_{Katz}(Francisco|on), since the phrase San Francisco is quite common, so the conventional unigram probability \frac{C(Francisco)}{\sum_w C(w)} is quite high. Then for another phrase consists of Francisco, if we use Katz Smoothing, then the probability

P_{Katz}(\text{on Francisco}) = \begin{cases} \frac{disc(C(\text{on Francisco}))}{c(on)} & \text{if} C(\text{on Francisco}) \gt 0 \\ \alpha(on) \times P_{Katz}(Francisco) & \text{otherwise} \end{cases} = \alpha(on) \times P_{Katz}(Francisco)\tag{19}

will also be quite high, while the word Francisco seldom occurs behind on. The Kneser-Ney smoothing use a modified backoff distribution based on the number of context each word occurs in, rather than the number of occurrences of the word:

P_{KN}(w_i | w_{i-1}) = \begin{cases} \frac{C(w_{i-1} - D)}{C(w_{i-1})} & \text{if } C(w_{i-1}w_i) \gt 0 \\ \alpha(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w|{v|C(vw) | gt 0}|} & \text{otherwise}\end{cases}\tag{20}

where \alpha is a normalization constant such that the probability sum to 1.

Goodman also propose an Interpolated Kneser-Ney smoothing:

P_{IKN}(w_i | w_{i-1}) = \frac{C(w_{i-1}w_i) - D}{C(w_{i-1})} + \lambda(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w |{v|C(vw)\gt 0}|}\tag{21}

where \lambda(w_{i-1}) is a normalization constant such that the probabilities sum to 1.

[1]. An Empirical Study of Smoothing Techniques for Language Modeling. S. Chen, J. Goodman. Technical Report. 1998.
[2]. A Bit of Progress in Language Modeling Extended Version. J. Goodman. Technical Report. 2001.