Econometrics

Expectation Maximization in Collaborative Filtering
Jonathan Baker 2010

Abstract Expectation maximization (EM) is a method of approximating maximum likelihood estimators (MLE) in models with missing data or latent variables. A straightforward application of EM is collaborative ﬁltering (CF): using data from multiple agents to predict unreported values. In this paper, we show a simple method of applying EM to a large CF problem: predicting ratings in the Netﬂix Prize dataset.

1

Auxiliary Functions

EM belongs to a general class1 of optimization algorithms using successive locally approximating auxiliary functions. For an objective function f : X → R, call g : X × X → R an auxiliary function if ∀x, x0 ∈ X : g(x, x0 ) ≥f (x) g(x, x) =f (x).

∞ Then for any x0 , deﬁne the sequence (xn )n=0 by

xn+1 = arg min g(x, xn ) x 1 For example, the Newton Rhapson method can also be stated in terms of auxiliary functions.

1

This sequence has a non-increasing image under f . This is easy to prove:

f (xn+1 ) ≤ g(xn+1 , xt ) ≤ g(xn , xn ) = f (xn ). This idea may be more clear graphically in ﬁgure (1). Each g(·, xt ) dominates f (·) but is equal at xt . It is easy to see why we might hope that the minimums of g would approach the minimum of f .

Figure 1: An objective function f and the auxiliary function g on two iterations Assuming f is bounded below (and it really ought to be since we are trying to ﬁnd its minimum value), then f (xn )∞ is also bounded below. We have already n=0 shown that (f (xn ))∞ is monotonically decreasing and so must converge. Hown=0 ever, without more information we cannot guarantee that (xn )∞ converges, n=0 let alone that it converges to a global minimizer. In speciﬁc applications–such as EM–we can say more about convergence.

2

2

Expectation Maximization

EM uses a form of this auxiliary function idea. Speciﬁcally, for complete data x (known and unknown values), unknown values z and probability density (or mass) function f , EM approximates θ through the sequence of estimators

θn+1 = arg max{E[ (θ|x, z)|x, θn ]} θ (1)

where (θ) := ln(f (x, z|θ)). is the log-likelihood function. The function g(θ, θn ) = −E[ln(f (x, z|θ))|x, θn ] (3) (2)

is closely related2 to an auxiliary function for the likelihood function such that the sequence given by (1) gives convergence of (θn |x). A thorough discussion of convergence conditions of θn itself is available in [2] (in all our testing, the estimators appeared to converge without problems).

3

Collaborative Filtering

Collaborative ﬁltering (CF) is the process of analyzing information collected from multiple agents in order to infer more information. CF techniques fall into two basic categories: • User-based: Agents that usually agree would have agreed on the missing data • Item-based: Items that multiple agents regard similarly have similar missing values
2 Details

in this relationship appear in the appendix

3

For example, we see item-based ﬁltering when Amazon tracks what items are similar to one another and, at check-out, suggests items similar to the customers’ purchases. Criticker employs user-based ﬁltering to recommend similar users’ favorite ﬁlms to each-other. Other applications may call for a mixture of these strategies. In all missing data problems, it is often convenient to suppose that the lack of response from an agent is not correlated with the its response (no response bias). If this assumption is good, we call the unobserved data missing. If the fact that a data point is unobserved is signiﬁcant, we will call it hidden. CF often deals with hidden rather than missing values (for example, customers are likely to use and rate primarily items they expect to like).

4

The Netﬂix Prize Dataset

In October 2004, Netﬂix Inc. announced a competition to design an algorithm giving better movie recommendations than Cinematch, Netﬂix’s own algorithm. For this purpose, Netﬂix released the ratings (1-5 integral stars) given by about 500,000 users for 177,000 of the ﬁlms Netﬂix rents. Included with the data were two lists of movie/user pairs: • Probe set: A subset of the distributed dataset values. Netﬂix recommended training with this set: hiding the probe values from the algorithm, predicting the probe values and comparing to the actually provided values • Qualifying set: A set of movie/user pairs whose ratings were provided by the users but withheld by Netﬂix for testing. Competitors submitted predicted ratings for the movie/user pairs in the qualifying set. A submission would win if a randomly selected subset of these submitted ratings (when compared to the actual ratings) had a root-mean-

4

squared-error (RMSE) lower than .8573 (a 10% improvement over Cinematch)3 . The dataset consists of 100,480,507 ratings (integers from 1-5) with an associated pair of ID numbers identifying the user giving the rating and the movie to which the rating was assigned. Netﬂix’s suggested interpretation of the stars is 1. “Hated It” 2. “Didn’t Like It” 3. “Liked It” 4. “Really Liked It” 5. “Loved It” The distribution of these ratings is displayed in ﬁgure (2). Also included were the date on which each rating was given and the title and release year of each movie. Most users have not rated most of the movies, so about 99.88% of the possible 84,993,453,000 movie/user pairs have no reported values. Despite the sparsity of the data, there are still enough values to make computation diﬃcult on a standard private processor. For this reason, we will study a random subsets of the users and movies. Figure (3) illustrates the sparsity of the subset we will focus on. However, we will also present the theory generally so anyone with the necessary computational power could analyze the entire set.

5

Model

We will suppose that for each user, ratings are distributed normally. That is, for each user u, ru is a vector of ratings for diﬀerent movies drawn from a
3 The competition was won in July 2009 by a three-team conglomeration with a RMSE of .8567 just 20 minutes before another team submitted predictions with the same RMSE. Because of the tie, the earlier submission won.

5

3.5

x 10

7

3

2.5

2

1.5

1

0.5

0

1

2

3

4

5

Figure 2: Distribution of ratings multivariate normal distribution

ru ∼ N (µ, Σ) independent of the missing data. This is item-based ﬁltering since we are essentially holding users constant and studying the properties of the movies. This is diﬀerent from supposing that each movie’s ratings have a multivariate normal distribution which would be user-based ﬁltering. We choose to focus on item-based rather than user-based methods because 1. There are many more users than ﬁlms: If correlation between users is calculated, we estimate more parameters than we have data (an ill-posed problem in general). Indeed, EM’s approximated correlation matrix will be low-rank (singular) after few iterations and EM cannot continue. We could use a subset of the data with more ﬁlms than users, but this would be unrepresentative of the original data. 2. Results of item-based ﬁltering will be easier to assess: We may be able to intuitively check the correlation matrix generated by EM if it 6

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

10

20

30 nz = 893

40

50

60

Figure 3: Dots show the 893 existing ratings for 60 users on 9000 movies. Column 30 corresponds to Something’s Gotta Give (2003): the most rated ﬁlm in the subset.

7

represents correlations between movies: we may believe EM worked well if it predicts high correlation between movies of the same genre. We would not be able to examine a user-correlation matrix in this fashion since we have little information about the users. We have chosen this model largely for its simplicity so that we may freely demonstrate results of using EM in CF problems. Some legitimate concerns with the model as a whole are 1. We should expect strong selection bias: It is unreasonable to suppose that whether a user has seen/rated a ﬁlm is independent of how well the user liked the ﬁlm. It seems very likely that users would tend to see and rate ﬁlms that they will enjoy. We cannot justify this assumption except to lend simplicity to the model. 2. For each movie, ratings across users appear to be normal: The reverse appears to be untrue. In fact, many users give the same rating to every movie. We have assumed the opposite for the reasons mentioned above.

6

Using the EM Algorithm

We now describe our prediction procedure in detail. Other than the EM update rule (which depends on our model), the process should be similar for any application. First, we removed some known ratings. We will predict these ratings based on the remaining data and compare the results to the actual values. We will report results for the probe set speciﬁcally, but considered analysis of this single set to be insuﬃcient, so we repeated this process on many random subsets of the data.

8

From here on, when we refer to data, we will mean the data without the values we hid for testing purposes. We next determine which movie/user pairs (and which model parameters) it will be possible to predict. If a certain component of a multivariate normal variable has never been observed, it is impossible to make predictions about that component. In our case, if a movie has never been seen, we can make no predictions about how well it will be received (let alone by a speciﬁc person). Similarly, if a certain component has been observed only once, it is not reasonable to estimate its variance. The extreme sparsity of the data (made worse by hiding some for tests) and using only a small portion of the data may mean we cannot predict many values. In fact, out of the 60 × 9000 section selected, only 9 of the 28 values of the probe set could be predicted4 . For movies that have only received 1 rating (or all ratings of the same value) we may make the obvious predictions of its mean and rating from other users. We may even predict that its rating variance is 0, but because of the need to invert the covariance matrix, these movies should not be included in the actual algorithm. We will suppose now that all pathological users and movies have been handled separately such that all movies have received at least 2 diﬀerent ratings and that all users have provided at least 1 rating. For describing the update process, it will convenient to deﬁne notation for partitions of vectors and matrices. For lists of indices

J =(j1 , j2 , . . . , jp ) K =(k1 , k2 , . . . , kq )
4 Most users and movies have more than 1 rating so nearly all the values should be predictable from the full dataset.

9

deﬁne  ξj1      ξj2    ξJ :=  .   .   .    ξjp   aj1 ,k1 aj1 ,k2   aj2 ,k1 aj2 ,k2  AJ,K :=  . .  . . . .   ajp ,k1 ajp ,k2   aj1 ,i      aj2 ,i    ∀i : AJ,i :=  .   .   .    ajp ,i 

··· ··· .. . ···

 aj1 ,kq   aj2 ,kq    .  . .   ajp ,kq

For each user u, deﬁne Ku as the list of indices (k1 , k2 , . . . , kp ) such that c c c c each rki ,u is known. Similarly deﬁne Ku := (k1 , k2 , . . . , kq ) as the list of indices

of ru corresponding to unknown data. Recall that we deﬁned the ratings for each user u as a column vector ru . Let ru be the ratings for user u with the tth estimates of the missing values in place. We will call the entire rating matrix  r1,1  (t)  r  2,1 (t) . . . rn ] =  .  .  .  (t) rm,1 
(t) (t)

r1,2 r2,2 . . . rm,2
(t) (t)

(t)

··· ··· .. . ···

(t) (t) R(t) := [r1 r2

 (t) r1,n  (t)  r2,n   .  .  .   (t) rm,n

Notice that RKu ,u never depends on t since the known values are never altered, so we simply call RKu ,u the vector of known values for user u.

(t)

10

For our purposes, we took initial estimates of the parameters to be  µ1  (0)  µ  2 = .  .  .  (0) µm 
(0)

        

µ(0)

Σ(0) =Im where each µi
(0)

is the mean of known ratings for movie i and Im is the m × m

identity matrix. No arbitrary initial estimates of the the unknown ratings need be made (R(0) will be determined by µ(0) and Σ(0) ). We are now ready to describe the EM update process for our model: 1. Estimate unknown values as their expected values given known data and the current estimates µ(t) , Σ(t) . Speciﬁcally, for each user u, update estimates of unknown values of in ru by c RKu ,u =E RKu ,u |RKu ,u , µ(t) , Σ(t) c

(t)

=µKu + ΣKu ,Ku ΣKu ,Ku c c

(t)

(t)

(t)

−1

RKu ,u − µKu

(t)

(4)

2. Obtain new estimated parameters µ(t+1) , Σ(t+1) from known data and unknown data estimated in 1:
(t+1)

∀i : µi

=

1 (t) R n u=1 i,u
T

n

(5) − µ(t+1) µ(t+1)
T

1 Σ(t+1) = R(t) R(t) n

(6)

3. Repeat 1-2 while resulting changes in estimated parameters are small The update (4) is simply the expected value of unknown components of a multivariate normal distribution given the known components (and their estimated 11

means and covariances). The updates (5) and (6) are the MLE estimators of µ, Σ assuming the values just estimated in (4) were actual observations.

7

Results

Of the 28 probe ratings in the subset, only 9 could be predicted for reasons discussed in the previous section. The resulting RMSE is surprisingly low (low enough to have won the Netﬂix Prize if it could have been achieved for the qualifying set). Unfortunately, this result is not typical. To illustrate this, consider the average RMSE for when predicting the same number of randomly selected ratings: 100 trials resulted in an average RMSE more than double that of these 9 in the probe set. Another problem is the values that EM simply cannot predict. Failing to predict values is usually unacceptable. We should provide for some means of predicting the diﬃcult values and incorporate these additional errors into the RMSE. For example, we might simply predict the overall mean rating for all the diﬃcult ratings. This increased the RMSE to worse than Cinematch’s. The RMSE’s compared to those of signiﬁcant algorithms are listed in table (1). Table 1: EM’s and Other Algorithms’ RMSE’s EM: 9 Probe Values EM: 28 Probe Values (na¨ predictions for diﬃcult values) ıve EM: 100 Random Trials Cinematch BellKor’s Pragmatic Chaos (contest winners) RMSE 0.8013 1.1241 1.7695 0.9525 0.8567

Finding signiﬁcance levels of EM estimators is relatively diﬃcult, but we might expect to be able to judge the accuracy of the correlation matrix based on its predicted correlations between ﬁlms’ ratings. In the subset of ﬁlms used in this study, the ﬁlms with the highest predicted ratings correlation were Rudolph the Red-Nosed Reindeer (a 1964 stop-motion Christmas TV special) 12

and Carandiru (a 2003 Brazilian ﬁlm about a prison in S˜o Paulo). It seems a unlikely that these ﬁlms’ ratings should be so correlated since the ﬁlms’ contents seem very dissimilar. Such a dissatisfying result could be a product of our admittedly unreasonable assumptions, but a direct analysis is very diﬃcult since no user rated both ﬁlms. This apparently nonsensical prediction is similar to a result found by principle component analysis (PCA). PCA ﬁnds latent eﬀects in data by singular value decomposition (SVD), but cannot provide any interpretation of these effects. When each missing value is replaced with the average of the corresponding movie’s and user’s mean ratings, PCA predicts high similarity5 between Elmo’s World: The Street We Live On (a light-hearted, educational children’s ﬁlm) and Die Hard 2 (an intense action movie). The commonality of these movies is not apparent, but is indicated by the data.

References
[1] Sean Borman. The expectation maximization algorithm: A short tutorial. 2004. [2] Geoﬀrey Mclachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. John Wiley and Sons, New York, 1996.

Appendix
We follow the demonstration given in [1] that maximizing (3) with respect to θ is equivalent to maximizing an auxiliary function for the log-likelihood function (2). The auxiliary function will be deﬁned explicitly in (7). Note the necessity of the negative sign in (3) since auxiliary functions are deﬁned for minimization.
5 based

on similar scores in the most signiﬁcant eﬀects

13

Now, denoting the probability measure by P: (θ) − (θn ) = ln f (x|θ) − ln f (x|θn ) = ln z f (x|z, θ)f (z|θ)dP − ln f (x|θn ) f (z|x, θn )

= ln

f (x|z, θ)f (z|θ) dP − ln(f (x|θn )) f (z|X, θn ) z f (x|z, θ)f (z|θ) dP − ln(f (x|θn )) ≥ f (z|X, θn ) ln f (z|x, θn ) z (Jensen’s inequality)

= z f (z|x, θn ) ln

f (x|z, θ)f (z|θ) f (z|x, θn )f (x|θn )

dP

=:∆(θ|θn ) We claim that the function G(θ, θn ) := − (θn ) − ∆(θ|θn )

(7)

is an auxiliary function for − (and so helps max imize ). We have just demonstrated − (θ) ≤ G(θ, θn ). To ﬁnish proving the claim, we also need to show G(θ, θ) = − (θ) − ∆(θ|θ) = − (θ) − z f (z|x, θ) ln

f (x|z, θ)f (z|θ) f (z|x, θ)f (x|θ)

dP

(Bayes’ Rule) = − (θ) − z f (z|x, θ) ln

f (x, z|θ) f (x, z|θ)

dP

= − (θ) − z f (z|x, θ) ln(1)dP

= − (θ)

Minimizing (3) (as done in each iteration of EM) is equivalent to minimizing

14

(7) (that is, the same sequence of estimates θn is generated) because arg min G(θ, θn ) = arg min {− (θn ) − ∆(θ|θn )} θ θ

= arg min − (θn ) − θ z

f (z|x, θn ) ln

f (x|z, θ)f (z|θ) f (z|x, θn )f (x|θn )

dP

(dropping terms that are constant with respect to θ) = arg min − θ z

f (z|x, θn ) ln(f (x|z, θ)f (z|θ))dP f (z|x, θn ) ln(f (x, z|θ))dP z = arg min − θ = arg min {−E[ln(f (x, z|θ))|x, θn ]} θ = arg min g(θ, θn ) θ 15

Similar Documents

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Most Harmless Econometrics

Econometrics Book Description

Econometrics Project

Applied Econometrics Individual Assignment

Nonparametric Estimation and Hypothesis Testing in Econometric Models by A. Ullah

Ningning

Making Decisions Based on Demand and Forecasting

Do Mind Your Mind

Econometric

Stock Market Relation

Popular Essays