If your email address has ever landed on a spammer's list, you know what it's like for your inbox to be flooded with junk email day after day after day. To prevent this, spam filters were created, relying on a mixture of brute force computing with passive learning and refined processing with active learning. And while spam filters have become more sophisticated and your inbox is increasingly free of junk email, the theory behind active learning has lagged. In the last few years, however, the field has taken off.
"There's been surprisingly rapid progress," says Steve Hanneke, a Ph.D. student at Carnegie Mellon University. "If you look back five years, there was really very little known about what makes something an informative example, how important are they, and how much improvement we can expect. But we now have in the published literature a pretty clear picture of just how much improvement we can expect in active learning and what we mean by an informative example."
The difference between passive learning and active learning is about the teacher, and how much time the teacher wants to spend teaching. Passive learning requires large data sets, and the teacher has to label countless examples for the learner. Once every example is labeled, the data set is given to the learner, which then finds patterns that will allow it to sort future data correctly.
The obvious drawback is that passive learning requires a lot of time, and that's where active learning enters the picture.
In active learning, all of the examples are provided to the learner unlabeled. An algorithm analyses the unlabeled data set, and asks the teacher for labels. After the algorithm determines the basic shape of each label set, it asks the teacher to define the ambiguous examples in-between the various labels. By labeling only the most informative examples, the hope is that fewer labels will be needed to tell the difference between, for instance, a spam email message offering you a free cruise and an email message from your parents telling you about their Alaskan cruise.
"By allowing the algorithm to interactively choose the 'most informative' samples to be labeled, the algorithm will be able to produce a much more accurate model using fewer labeled samples, meaning that we have to do a lot less tedious labeling by hand to get accurate prediction models," says Jenn Wortman, a Ph.D. student at the University of Pennsylvania. "It is possible to prove that in some special cases, active learning algorithms can produce models with the same accuracy as models built by passive learning algorithms while requiring exponentially fewer labeled examples."
Of course, deciding what makes an example informative or not is a huge challenge. Little progress had been made until 2005 when Sanjoy Dasgupta, a professor of computer science at the University of California, San Diego, published a paper, "Coarse Sample Complexity Bounds for Active Learning," which quantified how many labeled examples are needed to find a pattern with active learning over passive learning. The result was significantly less with active learning, but some basic assumptions were needed to see such dramatic outcomes.
One of the most problematic assumptions was that there could be no mislabeled examples causing noise, which would be unreasonable in real-world applications. The following year, however, Maria-Florina Balcan, Alina Beygelzimer, and John Langford published a paper, "Agnostic Active Learning," which described the first active learning algorithm that could work with noise but still provided improvement over passive learning. The algorithm, A2, labeled examples from a region of disagreement at random, eliminating certain hypotheses until a pattern emerged that was as close as possible to a true understanding.
One of the major improvements of A2 was finding a threshold function. With passive learning, every piece of data must be analyzed between two bounds before the threshold can be reliably established. With active learning, however, the algorithm can narrow down the threshold value in an almost binary search, resulting in exponentially fewer labels.
Additional research by Hanneke demonstrated that the algorithm performed well as long as the samples in the region of disagreement weren't too diverse and didn't differ in too many ways. Also, further refinements addressed the sampling bias created by labeling only the most informative examples. In essence, the labeled examples were not being chosen at random, and future data distributions might differ sharply from the labeled examples selected by the algorithm.
To correct the sampling bias, or generalization error, Daniel Hsu, a Ph.D. student at University of California, San Diego, and Dasgupta recommended a scheme where each chosen example has an importance weighting determined by its selection probability. More likely points have higher weightings, and less likely points have lower weightings. As a result, the algorithm could realize when it had selected an unlikely point that was very informative and correct its expectations for any future data.
With several solid years of theoretical work behind them, the active learning researchers are beginning to tackle an even larger challenge than developing the theoretical underpinnings of active learning. Now, they are trying to link theoretical algorithms with large-scale applications.
"There's a striking difference between what's known in practice and theory with active learning," says Hsu. "The theory is trying to catch up with what is known about what works well in practice, and to be able to say something mathematical about what works and what doesn't."
The University of Pennsylvania's Wortman agrees. "[Active learning] is used quite frequently in practice with impressive results, and it's a subject of increasing interest in the theory community, but there's still a huge gap between theory and practice. At a fundamental level, we don't really know 'why' the particular active learning algorithms that are used in practice work."
Toward that goal, some theorists are looking to find general methods of building algorithms that can be used in any situation. General purpose-built algorithms have existed for decades in passive learning. However, with active learning, an algorithm must almost be built from scratch for each new application.
At Carnegie Mellon, Hanneke is working on a type of "meta-algorithm" that can take established passive learning algorithms and churn out an active learning algorithm with passive learning algorithms as subroutines. His hope is that this new work, which he calls "activized learning," will have the benefit of using established passive learning, alongside active learning with its guaranteed improvements, on the number of labels needed.
Maria-Florina Balcan, a post-doctoral student with Microsoft Research New England, believes that active learning algorithms are too delicate for such a general method to exist. Instead, Balcan is broadening the type of interaction that active learning can handle beyond just labeling unlabeled examples. She points out that in other situations another interaction method might be more natural.
Take the example of trying to classify images of people by gender, says Balcan. If the interaction is changed, the teacher might be able complain to the learner that some classifiers are too general and need to be split, or that some are too similar and need to be merged. The teacher might suggest a new class to be added, or notice that the learning is making certain mistakes and suggest a new zone that will eliminate those classification errors.
"Much of the theoretical and practical work on active learning was developed on a very specific model and function of interaction," says Balcan. "But the interaction could be different. The main direction that I see for improvement in the future is extending the type of interaction between the learning algorithm and the user."
Given the progress made in the past several years, with more horizons opening up, the future of active learning appears to be very promising. "We now have active learning algorithms providing substantial guarantees on performance while relying on relatively minimal assumptions about the world," notes John Langford, doctor of learning at Yahoo! Research. "We have a reasonable understanding of where and when they provide performance improvements over passive learning. These algorithms are also practical, often yielding substantial savings in label complexity over passive learning approaches."
©2009 ACM 0001-0782/09/0400 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.
No entries found