What Is a Good Recommendation Algorithm?

Netflix is offering one million dollars for a better recommendation engine. Better recommendations clearly are worth a lot.

But what are better recommendations? What do we mean by better?

In the Netflix Prize, the meaning of better is quite specific. It is the root mean squared error (RMSE) between the actual ratings Netflix customers gave the movies and the predictions of the algorithm.

Let’s say we build a recommender that wins the contest. We reduce the error between our predictions and what people actually will rate by 10% over what Netflix used to be able to do. Is that good?

Depending on what we want, it might be very good. If what we want to do is show people how much they might like a movie, it would be good to be as accurate as possible on every possible movie.

However, this might not be what we want. Even in a feature that shows people how much they might like any particular movie, people care a lot more about misses at the extremes. For example, it could be much worse to say that you will be lukewarm (a prediction of 3 1/2 stars) on a movie you love (an actual of 4 1/2 stars) than to say you will be slightly less lukewarm (a prediction of 2 1/2 stars) on a movie you are lukewarm about (an actual of 3 1/2 stars).

Moreover, what we often want is not to make a prediction for any movie, but find the best movies. In TopN recommendations, a recommender is trying to pick the best 10 or so items for someone. It does not matter if you cannot predict what people will hate or shades of lukewarm. The only thing that matters is picking 10 items someone will love.

A recommender that does a good job predicting across all movies might not do the best job predicting the TopN movies. RMSE equally penalizes errors on movies you do not care about seeing as it does errors on great movies, but perhaps what we really care about is minimizing the error when predicting great movies.

There are parallels here with web search. Web search engines primarily care about precision (relevant results in the top 10 or top 3). They only care about recall when someone would notice something they need missing from the results they are likely to see. Search engines do not care about errors scoring arbitrary documents, just their ability to find the top N documents.

Aggravating matters further, in both recommender systems and web search, people’s perception of quality is easily influenced by factors other than the items shown. People hate slow websites and perceive slowly appearing results to be worse than fast appearing results. Differences in the information provided about each item (especially missing data or misspellings) can influence perceived quality. Presentation issues, even color of the links, can change how people focus their attention and which recommendations they see. People trust recommendations more when the engine can explain why it made them. People like recommendations that update immediately when new information is available. Diversity is valued; near duplicates disliked. New items attract attention, but people tend to judge unfamiliar or unrecognized recommendations harshly.

In the end, what we want is happy, satisfied users. Will a recommendation engine that minimizes RMSE make people happy?