Smoothing Techniques for Language Modeling

Smoothing Techniques for Language Modeling

Language Model predicts the probability that a given string s is a real sentence. There are two most widely-used methods: n-gram language models and neural-network based language models. Here we focus on n-gram language models, in particular, we focus on n=2, that is, the bigram model.

Given a string s, which is a sequence of words w_1,\cdots,w_l, its probability can be expressed as

p(s) = p(w_1)p(w_2|w_1)p(w_3|w_1w_2) \cdots p(w_l|w_1, \cdots w_{l-1}) = \prod_{i=1}^lp(w_i|w_1\cdots w_{i-1})\tag{1}

In bigram models, we make the approximation that the probability of a word depends only on immediately preceding word:

p(s) = \prod_{i=1}^lp(w_i | w_1 \cdots w_{l-1}) \approx \prod_{i=1}^l p(w_i | w_{i-1})\tag{2}

The estimate p(w_i | w_{i-1}), we can take

p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)}\tag{3}

The above equation is called the maximum likelihood estimate of p(w_i | w_{i-1}), because the assignment of probabilities yields the bigram model that assigns the highest probability to the training data of all possible bigram models.

Smoothing

However, there are many bigrams that doesn't occur in the training data, then calculating the probability using p(w_i | w_{i-1})=\frac{c(w_{i-1}w_i)}{\sum_{w_i}c(w_{i-1}w_i)} will get 0, which will further result in the zero probability of the whole sentence. This certainly doesn't make sense. This is why we introduce the smoothing methods for n-gram models.

Additive(Laplace) Smoothing

The simplest smoothing method would be to pretend each bigram occurs once more that it actually does, that is:

p(w_i | w_{i-1}) = \frac{1 + c(w_{i-1}w_i)}{\sum_{w_i}[1 + c(w_{i-1}w_i)]} = \frac{1 + c(w_{i-1}w_i)}{|V| + \sum_{w_i}c(w_{i-1}w_i)}\tag{4}

where V is the vocabulary.

The above method can be further generalized, instead of pretending each n-gram occurs once more that it does, we pretend it occurs \sigma times more:

p_{add}(w_i|w_{i-n+1}^{i-1})=\frac{\delta + c(w_{i-n+1}^i)}{\delta|V| + \sum_{w_i}c(w_{i-n+1}^i)}\tag{5}

Good-Turing Estimate

Good-Turing estimate is used for many other smoothing methods, such as Katz Smoothing. The basic idea is that for any n-gram occurs r times, we should pretent that it occurs r^* times, where:

r^* = (r+1)\frac{n_{r+1}}{n_r}\tag{6}

where n_r is the number of n-grams that occur exactly r times in the training data. The count can be converted to a probability by

p_{GT}(\alpha) = \frac{r^*}{N}\tag{7}

where \alpha is an n-gram with r counts, and N=\sum_{r=0}^{\infty}n_r r^*

Jelinek-Mercer Smoothing

Both the additive smoothing as well as Good-Turing estimate will give the same probability to an n-gram that occurs 0 times in the training data. For example, if we have c(\text{BURNISH THE}) =0 and c(\text{BURNISH THOU})=0, then we have

p(THE|BURNISH) = p(THOU|BURNISH)

according to both Additive Smoothing and Good-Turing Estimate. However, by intuition, we should have

p(THE|BURNISH)  data-recalc-dims= p(THOU|BURNISH)" />

because the word THE is much more common than the word THOU. To capture this behavior, we can interpolate the bigram model with a unigram modelL

p_{interp}(w_i|w_{i-1})=\lambda p_{ML}(w_i|w_{i-1}) + (1 - \lambda)p_{ML}(w_i)\tag{8}

In general, it's useful to interpolate higher-order n-gram model with lower-order n-gram models, because when there is insufficient data to estimate the probability of higher-order model, the lower-order model can often provide useful information. An elegant way of performing this interpolation is:

p_{interp}(w_i | w_{i-n+1}^{i-1}) = \lambda_{w_{i-n+1}^{i-1}} p_{ML}(w_i | w_{i-n+1}^{i-1}) + (1 - \lambda_{w_{i-n+1}^{i-1}}) p_{interp}(w_i |w_{i-n+2}^{i-1})\tag{9}

To end the recursion, we can take the smoothed 1st-order model to be the maximum likelihood distribution, or we can take the smoothed 0th-order model to be the uniform distrobution:

p_{unif}(w_i)=\frac{1}{|V|}\tag{10}

Katz Smoothing

Katz Smoothing extends the Good-Turing estimate by interpolating higher-order models with lower-order models, the Katz Smoothing for bigram models is:

c_{katz}(w_{i-1}^i)= \begin{cases} d_r r & \text{if} r \gt 0 \\ \alpha(w_{i-1})p_{ML}(w_i) & \text{if} r = 0\end{cases}\tag{11}

All bigrams with a nonzero count r are discounted according to a discount ratio d_r, which is approximately \frac{r^*}{r}. The counts subtracted from the nonzero counts are then distributed among the zero-count bigrams according to the next lower-order distribution. The \alpha(w_{i-1}) is chosen so that the total number of counts in the distribution \sum_{w_i}c_{katz}(w_{i-1}^i) is unchanged, i.e., \sum_{w_i}c_{katz}(w_{i-1}^i)=\sum_{w_i}c(w_{i-1}^i). The appropriate value for \alpha(w_{i-1}) is

\alpha(w_{i-1})=\frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{\sum_{w_i: c(w_{i-1}^i=0)}p_{ML}(w_i)} = \frac{1-\sum_{w_i:c(w_{i-1}^i\gt 0)}p_{katz}(w_i | w_{i-1})}{1 - \sum_{w_i: c(w_{i-1}^i\gt0)}p_{ML}(w_i)}\tag{12}.

The probability can be got by normalization:

p_{katz}(w_i | w_{i-1})=\frac{c_{katz}(w_{i-1}^i)}{\sum_{w_i}c_{katz}(w_{i-1}^i)}\tag{13}

The d_r is calculated so that: 1) Large counts r  data-recalc-dims= k" /> should be considered reliable, so that they should not be discounted, where Katz suggest k = 5. 2) Small counts r \le k are derived from the Good-Turing estimate. The d_r for small counts is chosen such that the resulting discounts are proportional to the discounts predicted by Good-Turing Estimate, while at the same time the total number of counts discounted in the global bigram distribution is equal to the total number of counts that should be assigned to bigrams with zero counts.

The first constraint results in

1 - d_r = \mu (1 - \frac{r^*}{r})\tag{14}

for r \in {1, ..., k}. The Good-Turing estimate predicts that the total number of counts that should be assigned to bigrams with zero counts is n_0 0^* = n_0 \frac{n_1}{n_0} = n_1, so the second constraint is

\sum_{r=1}^k n_r(1 - d_r)r = n_1\tag{15}

The unique solution to these equations is

d_r = \frac{\frac{r^*}{r}-\frac{(k+1)n_{k+1}}{n_1}}{1 - \frac{(k+1)n_{k+1}}{n_1}}\tag{16}

Witten-Bell Smoothing

Witten-Bell can be considered as an instance of Jelinek-Mercer smoothing:

p_{WB}(w_i|w_{i-n+1}^{i-1})=\lambda_{w_{i-n+1}^{i-1}}p_{ML}(w_i, w_{i-n+1}^{i-1}) + (1-\lambda_{w_{i-n+1}^{i-1}})p_{WB}(w_i | w_{i-n+2}^{i-1})\tag{17}

where 1-\lambda_{w_{i-n+1}^{i-1}}=\frac{N_{1+}(w_{i-n+1}^{i-1}\bullet)}{N_{1+}(w_{i-n+1}^{i-1}\bullet)+\sum_{w_i}c(w_{i-n+1}^i)}

where N_{1+}(w_{i-n+1}^{i-1}\bullet)=|{w_i:c(w_{i-n+1}^{i-1}) \gt 0}|

The 1-\lambda_{w_{i-n+1}^{i-1}} can be regarded as the probability that a word not observed in the training data occurs after that history.

Absolute Discounting

Absolute Discounting also involves interpolation of higher- and lower- order models. However, instead of multiplying the higher-order maximum-likelihood distribution by a factor \lambda_{w_{i-n_+1}^{i-1}}, the discount is done by subtracting a fixed discount D \le 1 for each nonzero count:

p_{abs}(w_i | w_{i-n+1}^{i-1})=\frac{max{c(w_{i-n+1}^{i-1}) - D, 0}}{\sum_{w_i}c(w_{i-n+1}^i}) + (1- \lambda_{i-n+1}^{i-1})p_{abs}(w_i | w_{i-n+2}^{i-1})\tag{18}

where

1- \lambda_{i-n+1}^{i-1}=\frac{D}{\sum_{w_i} c(w_{i-n+1}^i)}N_{1+}(w_{i-n+1}^{i-1}\bullet)

The value of D could be D=\frac{n_1}{n_1 + 2n_2}

Kneser-Ney Smoothing

Consider a conventional bigram model of a phrase such as P_{Katz}(Francisco|on), since the phrase San Francisco is quite common, so the conventional unigram probability \frac{C(Francisco)}{\sum_w C(w)} is quite high. Then for another phrase consists of Francisco, if we use Katz Smoothing, then the probability

P_{Katz}(\text{on Francisco}) = \begin{cases} \frac{disc(C(\text{on Francisco}))}{c(on)} & \text{if} C(\text{on Francisco}) \gt 0 \\ \alpha(on) \times P_{Katz}(Francisco) & \text{otherwise} \end{cases} = \alpha(on) \times P_{Katz}(Francisco)\tag{19}

will also be quite high, while the word Francisco seldom occurs behind on. The Kneser-Ney smoothing use a modified backoff distribution based on the number of context each word occurs in, rather than the number of occurrences of the word:

P_{KN}(w_i | w_{i-1}) = \begin{cases} \frac{C(w_{i-1} - D)}{C(w_{i-1})} & \text{if } C(w_{i-1}w_i) \gt 0 \\ \alpha(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w|{v|C(vw) | gt 0}|} & \text{otherwise}\end{cases}\tag{20}

where \alpha is a normalization constant such that the probability sum to 1.

Goodman also propose an Interpolated Kneser-Ney smoothing:

P_{IKN}(w_i | w_{i-1}) = \frac{C(w_{i-1}w_i) - D}{C(w_{i-1})} + \lambda(w_{i-1})\frac{|{v|C(vw_i) \gt 0}|}{\sum_w |{v|C(vw)\gt 0}|}\tag{21}

where \lambda(w_{i-1}) is a normalization constant such that the probabilities sum to 1.

Reference:
[1]. An Empirical Study of Smoothing Techniques for Language Modeling. S. Chen, J. Goodman. Technical Report. 1998.
[2]. A Bit of Progress in Language Modeling Extended Version. J. Goodman. Technical Report. 2001.