Research Directions For Machine Learning and Algorithms

John Langford, Microsoft Research New York

Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came, though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about five years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n² dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.
Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that the development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do effective offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
How can we learn from lots of data? We’ll be presenting a KDD Survey/Tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it.
What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, Graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
The Deep Learning Problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted on hunch.net.