The Biggest Gains Come From Knowing Your Data

Machine learning is hard. It can be awfully tempting to try to skip the work. Can’t we just download a machine learning package? Do we really need to understand what we are doing?

It is true that off-the-shelf algorithms are a fast way to get going and experiment. Just plug in your data and go.

The only issue is if development stops there. By understanding the peculiarities of your data and what people want and need on your site, by experimenting and learning, it is likely you can outperform a generic system.

A great example of how understanding the peculiarities of your data can help came out of the Netflix Prize. Progress on the $1M prize largely stalled until Gavin Potter discovered pecularities in the data , including that people interpret the rating scale differently.

More recently, Yehuda Koren found additional gains by supplementing the models to allow for temporal effects, such as that people tend to rate older movies higher, that movies rated together in a short time window tend to be more related, and that people over time might start rating all the movies they see higher or lower.

In both cases, looking closely at the data, better understanding how people behave, and then adapting the models yielded substantial gains. Combined with other work, that was enough to win the million dollar prize.

The Netflix Prize followed a pattern you often see when people try to implement a feature that requires machine learning. Most of the early attempts throw off-the-shelf algorithms at the data, yielding something that works, but not with particularly impressive results.

Without a clear metric for success and a way to test against that metric, development stops there. But, like Google and Amazon do with ubiquitous A/B testing, the Netflix Prize had a clear metric for success and a way to test against that metric.

There are a lot of lessons that can be taken from the Netflix contest, but a big one should be the importance of constant experimentation and learning. By competing algorithms against each other, by looking carefully at the data, by thinking about what people want and why they do what they do, and by continuous testing and experimentation, you can reap big gains.