# Smoothing Techniques for Language Modeling

**Language Model** predicts the probability that a given string is a real sentence. There are two most widely-used methods: **-gram language models** and **neural-network based language models**. Here we focus on -gram language models, in particular, we focus on , that is, the bigram model.

Given a string , which is a sequence of words , its probability can be expressed as

In bigram models, we make the approximation that the probability of a word depends only on immediately preceding word:

The estimate , we can take

The above equation is called the maximum likelihood estimate of , because the assignment of probabilities yields the bigram model that assigns the highest probability to the training data of all possible bigram models.

# Smoothing

However, there are many bigrams that doesn't occur in the training data, then calculating the probability using will get 0, which will further result in the zero probability of the whole sentence. This certainly doesn't make sense. This is why we introduce the smoothing methods for -gram models.

## Additive(Laplace) Smoothing

The simplest smoothing method would be to pretend each bigram occurs once more that it actually does, that is:

where is the vocabulary.

The above method can be further generalized, instead of pretending each n-gram occurs once more that it does, we pretend it occurs times more:

## Good-Turing Estimate

Good-Turing estimate is used for many other smoothing methods, such as Katz Smoothing. The basic idea is that for any n-gram occurs times, we should pretent that it occurs times, where:

where is the number of n-grams that occur exactly times in the training data. The count can be converted to a probability by

where is an n-gram with counts, and

## Jelinek-Mercer Smoothing

Both the additive smoothing as well as Good-Turing estimate will give the same probability to an -gram that occurs 0 times in the training data. For example, if we have and , then we have

according to both Additive Smoothing and Good-Turing Estimate. However, by intuition, we should have

p(THOU|BURNISH)" />

because the word THE is much more common than the word THOU. To capture this behavior, we can interpolate the bigram model with a unigram modelL

In general, it's useful to interpolate higher-order -gram model with lower-order -gram models, because when there is insufficient data to estimate the probability of higher-order model, the lower-order model can often provide useful information. An elegant way of performing this interpolation is:

To end the recursion, we can take the smoothed 1st-order model to be the maximum likelihood distribution, or we can take the smoothed 0th-order model to be the uniform distrobution:

## Katz Smoothing

Katz Smoothing extends the Good-Turing estimate by interpolating higher-order models with lower-order models, the Katz Smoothing for bigram models is:

All bigrams with a nonzero count are discounted according to a *discount ratio* , which is approximately . **The counts subtracted from the nonzero counts are then distributed among the zero-count bigrams according to the next lower-order distribution**. The is chosen so that the total number of counts in the distribution is unchanged, i.e., . The appropriate value for is

.

The probability can be got by normalization:

The is calculated so that: 1) Large counts k" /> should be considered reliable, so that they should not be discounted, where Katz suggest . 2) Small counts are derived from the Good-Turing estimate. The for small counts is chosen such that the resulting discounts are proportional to the discounts predicted by Good-Turing Estimate, while at the same time the total number of counts discounted in the global bigram distribution is equal to the total number of counts that should be assigned to bigrams with zero counts.

The first constraint results in

for . The Good-Turing estimate predicts that the total number of counts that should be assigned to bigrams with zero counts is , so the second constraint is

The unique solution to these equations is

## Witten-Bell Smoothing

Witten-Bell can be considered as an instance of Jelinek-Mercer smoothing:

where

where

The can be regarded as the probability that a word not observed in the training data occurs after that history.

## Absolute Discounting

Absolute Discounting also involves interpolation of higher- and lower- order models. However, instead of multiplying the higher-order maximum-likelihood distribution by a factor , the discount is done by subtracting a fixed discount for each nonzero count:

where

The value of could be

## Kneser-Ney Smoothing

Consider a conventional bigram model of a phrase such as , since the phrase *San Francisco* is quite common, so the conventional unigram probability is quite high. Then for another phrase consists of *Francisco*, if we use Katz Smoothing, then the probability

will also be quite high, while the word *Francisco* seldom occurs behind *on*. The Kneser-Ney smoothing use a modified backoff distribution based on the number of context each word occurs in, rather than the number of occurrences of the word:

where is a normalization constant such that the probability sum to 1.

Goodman also propose an Interpolated Kneser-Ney smoothing:

where is a normalization constant such that the probabilities sum to 1.

Reference:

[1]. An Empirical Study of Smoothing Techniques for Language Modeling. S. Chen, J. Goodman. Technical Report. 1998.

[2]. A Bit of Progress in Language Modeling Extended Version. J. Goodman. Technical Report. 2001.