Bayesian matrix factorization

Question

I am working with Bayesian matrix factorization using the MovieLens database. Data consist of a matrix $n \times d$ of $n=943$ users and $d=1682$ movies where users assign a rate (1-5) to movies. Clearly this results in a very sparse matrix since I have only 100000 observations out of 943*1682 possible entries.

I define the likelihood as $P(X|W, Z, \beta) = \prod_{(i,j) \in Obs} \mathcal{N}(X_{ij}|Z_iW_j^t, \beta^{-1})$ where $Z = n \times k$ matrix and $W=d \times k$ with $k$ = number of features, $Obs$ are the observed entries in the original matrix $X$.

I want to compute the posterior probability $P(Z, W,\beta|X)$.

I implemented this in Stan (using basic prior just to see if everything works):

data {
    int<lower=0> n_users;                       // Number of users (943)
    int<lower=0> n_movies;                      // Number of movies (1682)
    int<lower=0> n_features;                    // Number of features 
    int<lower=1> n_entries;                     // Number of entries in matrix 

    matrix[n_users, n_movies] rating; 


}
parameters {
    matrix[n_features, n_movies] W;         
    matrix[n_users, n_features] Z;
}

model {
    to_vector(rating) ~ normal(to_vector(Z*W), 1.0); // Likelihood 

    // Priors
    to_vector(W) ~ normal(0,1);
    to_vector(Z) ~ normal(0,1);
}

The problem is that it takes a huge amount of time to compute the posterior probability, is there a way to speed up the computation? I know that I should somehow include only the observations but I can't really figure out how to do it

You are exploring a very big parameter space. Stan is expected to be slow in this case because it has to perform symplectic integration for all the parameters **and** their momentum, effectively doubling the parameter space. I'm also confident there exist better priors for W and Z than just placing a normal on each of the elements. I'm not sure what that prior is, but perhaps someone has written about it. — Demetri Pananos, May 12 '20 at 13:51
You can speed things up a bit by defining `vector[n_users * n_movies] rating_vec = to_vector(rating);` in the transformed data block after the model block and then referring to `rating_vec` in the normal likelihood. There is no need to reshape a constant matrix to a vector every time the log-kernel function gets evaluated (which will be millions of times). — Ben Goodrich, May 13 '20 at 03:46

Tim · Answer 1 · 2020-05-13T10:08:21.590

What I understand, you are building a Bayesian version of the matrix factorization model described in Matrix Factorization Techniques for Recommender Systems by Koren, Bell and Volinsky. The original paper described how the model can be estimated using either stochastic gradient descent, where you iterate over the observed training samples and for each rating $r_{iu}$ by user $u$ and item $i$ we calculate the error

$$ e_{iu} = r_{iu} - q_i^Tp_u $$

and update the parameters

$$ q_i \leftarrow q_i + \gamma (e_{ui} p_u-\lambda q_i) \\ p_u \leftarrow p_u + \gamma (e_{ui} q_i-\lambda p_u) $$

I'm mentioning this, because I'm not sure if your code is correct. What are the ratings stored in rating matrix? It should be a $n \times k$ matrix, where $n$ is the number of users and $k$ items, where some of the ratings are missing (otherwise nothing to predict). I would expect to see rather something like a for loop:

for (j in 1:m) {
   rating[j] ~ normal(W[u[j], :] * Z[i[j], :], 1);
}

where rating, u, and i are vectors of length m, rating stores the ratings (non-missing!), while user and item stores the $u$ and $i$ indexes. By the way, this was mentioned by one of the core Stan developers, Bob Carpenter on StackOverflow.

Finally, as mentioned in the comment, you have a lot of parameters and a lot of observed values, so it probably can be quite slow.

Thank you! Yes, the 'rating' matrix is n x k matrix with 100000 non-zeros elements, most of the elements are missing because not every user ranked every movie. I was not sure the code I wrote was correct neither. However, both `W[u[j]]` and `Z[i[j]]` are selecting the rows, but shouldn't I need the columns of Z? Like this `W[u[j],:] * Z[:,i[j]]` — Antares, May 13 '20 at 08:40

Bayesian matrix factorization

1 Answers1