From Matrix Factorization to Factorization Machines

Matrix Factorization

In Recommender System, Matrix Factorization maps both the users and items to a joint latent factor space of dimension k, such that the user-item interaction can be modeled as the inner product in this space. That is to say, we map each user u to a vector \mathbf{p}_u\in \mathbb{R}^k, and each item i to a vector \mathbf{q}_i \in \mathbb{R}^k. For movie recommendation, each dimension of the latent factor space can be explained as a topic, say comedy v.s. drama, or other features such as amount of action, orientation to children and so on.
Given the latent vector for user u and item i, we can predict the interaction between them as

\hat{r}_{ui}=\mathbf{q}_i^T\mathbf{p}_u\tag{1}

The major challenge is to compute the latent vector for each user and item. This is quite similar with Singular Value Decomposition(SVD), a well-established techniques for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by the sparseness in the user-item rating matrix. One approach is to rely on imputation to fill in missing ratings to make the matrix dense. However, this will significantly increase the amount of data, and the inaccurate imputation might distort the data.

Matrix Factorization is a method which focus only on the observed ratings only, while avoid overfitting by regularization. Here is the cost function for matrix factorization

\min_{\mathbf{q}^*, \mathbf{p}^*}\sum_{(u, i) \in \kappa}(r_{ui} - \mathbf{q}_i^T\mathbf{p}_u)^2 + \lambda (||\mathbf{q}_i||^2 + ||\mathbf{p}_u||^2)\tag{2}

Here \kappa is the set of (u, i) pairs for which r_{ui} is known in the training set and  \lambda (||\mathbf{q}_i||^2 + ||\mathbf{p}_u||^2) is the regularization term.

Several learners have been proposed to optimize Matrix Factorization, including SGD, Coordinate Descent and MCMC.

Optimization by SGD

SGD loops through all ratings in the training set. For each rating r_{ui} in the training set, let  e_{ui} \stackrel{\text{def}}{=} r_{ui} - \mathbf{q}_i^T \mathbf{p}_u . Then it modifies the parameter in the opposite direction of the gradient yielding:

\mathbf{q}_i \gets \mathbf{q}_i + \gamma(e_{ui} \mathbf{p}_u -\lambda \mathbf{q}_i)
\mathbf{p}_u \gets \mathbf{p}_u + \gamma (e_{ui}\mathbf{q}_i -\lambda \mathbf{p}_u)

Optimization by ALS

Alternating Least Square is a popular approach to optimize regression models such as MF. It works by iteratively optimizing one parameter, while leaving others fixed. The prerequisite of ALS is that the optimization sub-problem can be analytically solved.

First, minimize J with respect to user latent vector \mathbf{p}_u is equivalent to minimizing:

J_u = ||(r_u-\mathbf{Qp}_u)||^2 + \lambda ||\mathbf{p}_u||^2

The minimum is where the first order derivative is 0:

\frac{\partial J_u}{\partial \mathbf{p}_u} = 2\mathbf{Q}^T\mathbf{Qp}_u - 2\mathbf{Q}^Tr_u + 2\lambda \mathbf{p}_u = 0
\Rightarrow \mathbf{p}_u = (\mathbf{Q}^T\mathbf{Q} + \lambda \mathbf{I})^{-1}\mathbf{Q}^Tr_u

Following the same process, we can get the solution for \mathbf{q}_i.

Adding Biases

The (2) only interpret the rating r_{ui} as an interaction between the user u and item i, but in the fact, the rating values can also due to effects associated with either users or items. For example, in the recommender system, some user tend to give higher rating than others, or some items is in general better than others. To consider all such effects, we can add a bias term to the (1)

\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T\mathbf{p}_u\tag{3}

and the corresponding cost function is

\min_{\mathbf{p},\mathbf{q},b}\sum_{(u, i)\in \kappa}(r_{ui}-\mu-b_u-b_i-\mathbf{p}_u^T\mathbf{q}_i)^2 + \lambda(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2 + b_u^2 + b_i^2)\tag{4}

Handle user and item features

Matrix factorization can also handle user and item features beside the rating information. For example, if we also have the implicit feedback such as the purchase or browsing history, as well as some user attributes.
We denote N(u) as the sets of items for which user u has expressed an implicit feedback, where each item i is associated with \mathbf{x}_i \in \mathbb{R}^f. So a user who showed a preference for items in N(u) is characterized by

|N(u)|^{-0.5}\sum_{i\in N(u)}\mathbf{x}_i\tag{5}

Denote A(u) as the set of attributes of a user u, and a factor vector \mathbf{y}^a \in \mathbb{R}^f corresponds to each attribute to describe a user through the set of user-associated attributes:

\sum_{a\in A(u)}\mathbf{y}^a\tag{6}

Then the predicted rating can be modeled by

\hat{r}_{ui} = \mu + b_i + b_u + \mathbf{q}_i^T[\mathbf{p}_u + |N(u)|^{-0.5}\sum_{i \in N(u)}\mathbf{x}_i + \sum_{a\in A(u)}\mathbf{y}^a]\tag{7}

Factorization Machines

Although Matrix Factorization can model such kind of implicit feedback and user or item attribute features, Factorization Machine can handle such kind of features more directly. For each rating sample with user features(e.g. age, gender, rating history), item features(e.g. genders, price) as well as context features(e.g. time) , we can generate a feature vector for each of these features and then build a factorization machine of degree d=2 to model the interactions between them:

\hat{y}(x) = w_0 + \sum_{i=1}^n \mathbf{w}_i \mathbf{x}_i + \sum_{i=1}^n\sum_{j=i+1}^n\langle \mathbf{v}_i, \mathbf{v}_j\rangle \mathbf{x}_i\mathbf{x}_j\tag{8}

where w_0 \in \mathbb{R}, \mathbf{w}\in \mathbf{R}^n, \mathbf{V}\in \mathbf{R}^{n\times k} are the model parameters that have to be estimated. A 2-way FM captures all single and pairwise interactions between variables:

  • w_0 is the global bias.
  • w_i models the strength of the i-th variable.
  • \hat{w}_{i,j} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle models the interaction between the i-th and j-th variable.

Instead of using one model parameter w_{ij}\in \mathbf{R} for each interaction, the FM models the interaction by factorizing it. This is the key point which allows high quality parameter estimates of higher-order interactions under sparsity

As we can see, instead of project the user and item into the latent vector space by matrix factorization, factorization machine project each feature into the latent vector space to better handle various features associated with the user, item and context.

Field-Aware Factorization Machine

Factorization Machine can effectively model the interaction between user and item, as well as the user side and item side features. But what if there are more than 3 dimension? For example, in the CTR prediction for computational advertising, we may have User, Advertisement as well as Publisher. There is interaction not only between the User and Advertisement, or between User and Publisher, but also interaction among all 3 dimensions. As shown above, the 2-way Factorization Machine can only handle interactions between two features. However, the Field-Aware Factorization Machine can handle all such interactions.
The formulation of FFM is

\min_\mathbf{w} \sum_{i=1}^L(log(1 + exp(-y_i \phi(\mathbf{w}, \mathbf{x}_i))) + \frac{\lambda}{2}||\mathbf{w}||^2

where

\phi(\mathbf{w}, \mathbf{x}) = \sum_{j_1, j_2 \in C_2}\langle \mathbf{w}_{j_1, f_2}, \mathbf{w}_{j_2, f_1} \rangle \mathbf{x}_{j_1}\mathbf{x}_{j_2}

where f_1, f_2 are respectively the field of j_1 and j_2 , and w_{j_1, f_2} and w_{j_2, f_1} are two vectors with length k . Here the field means the dimension, for the recommender system, typically there are 2 dimensions: User and Item. For CTR prediction, there are typically 3 dimensions: User, Advertisement and Publisher. But actually we can build one dimension for each ID features, or even category features.
Here is a concrete example, say there is a sample with 4 features(User, Movie, Gender, Price) and 5 fields(User-Yu, Movie-3I, Gender-Comedy, Gender-Drama, Price).

User(Us) Movie(Mo) Gender(Ge) Price(Pr)
YuChin(Yu) 3Idiots(3I) Comedy, Drama(Co, Dr) $9.99

Then FFM will create 10 vectors, each corresponds to one (field, feature) pair. The \phi(w, x) in FFM is
\langle w_{Us-Yu ,Mo} ,w_{ Mo-3I, Us}\rangle x_{Us-Yu} x_{Mo-3I}
+ \langle w_{Us-Yu ,Ge} ,w_{ Ge-Co, Us}\rangle x_{Us-Yu} x_{Ge-Co}
 + \langle w_{ Us-Yu,Ge}, w_{Ge-Dr, Us}\rangle x_{Us-Yu} x_{Ge-Dr}
 + \langle w_{ Us-Yu, Pr}, w_{ Pr, Us}\rangle x_{Us-Yu} x_{Pr}
 + \langle w_{Mo-3I , Ge} ,w_{ Ge-Co, Mo}\rangle x_{Mo-3I} x_{Ge-Co}
 + \langle w_{Mo-3I ,Ge} ,w_{Ge-Dr , Mo}\rangle x_{Mo-3I} x_{Ge-Dr}
 + \langle w_{Mo-3I ,Pr} ,w_{ Pr, Mo}\rangle x_{Mo-3I} x_{Pr}
 + \langle w_{Ge-Co ,Ge} ,w_{Ge-Dr , Ge}\rangle x_{Ge-Co} x_{Ge-Dr}
 + \langle w_{Ge-Co ,Pr} ,w_{ Pr, Ge}\rangle x_{Ge-Co} x_{Pr}
 + \langle w_{Ge-Dr ,Pr} ,w_{Pr , Ge}\rangle x_{Ge-Dr} x_{Pr}

FFM has shown very promising results in CTR prediction(KDD CUP 2012, 1st).

Conclusion

In conclusion, Matrix Factorization projects the user and item into a latent vector space to handle interactions between user and item. Factorization Machine projects each feature into a latent vector space to support interactions between various features associated with user, item and context. At last, Field-aware Factorization Machine projects each feature(field) pair into a latent vector space to handle the interactions between more than 3 dimensions.

Reference:

  1. Matrix Factorization Techniques for Recommender System. Y. Koren, et. at.
  2. Matrix Factorization and Factorization Machine for Recommender Systems. Chih-Jen Lin.
  3. Factorization Machines. Steffen Rendle.
  4. 3 Idiots' Approach for Display Advertising Challenge Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
  5. Field-aware Factorization Machine Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin.
  6. Pairwise interaction tensor factorization for personalized tag recommendation. Steffen Rendle, Lars Schmidt-Thieme.

Please comment here or send email to pandevirus@gmail for any suggestion.