Sign In

Communications of the ACM

Research highlights

Technical Perspective: A Better Way to Learn Features

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook

A typical machine learning program uses weighted combinations of features to discriminate between classes or to predict real-valued outcomes. The art of machine learning is in constructing the features, and a radically new method of creating features constitutes a major advance.

In the 1980s, the new method was backpropagation, which uses the chain rule to backpropagate error derivatives through a multilayer, feed-forward, neural network and adjusts the weights between layers by following the gradient of the backpropagated error. This worked well for recognizing simple shapes, such as handwritten digits, especially in convolutional neural networks that use local feature detectors replicated across the image.5 For many tasks, however, it proved extremely difficult to optimize deep neural nets with many layers of non-linear features, and a huge number of labeled training cases was required for large neural networks to generalize well to test data.

In the 1990s, Support Vector Machines (SVMs)8 introduced a very different way of creating features: the user defines a kernel function that computes the similarity between two input vectors, then a judiciously chosen subset of the training examples is used to create "landmark" features that measure how similar a test case is to each training case. SVMs have a clever way of choosing which training cases to use as landmarks and deciding how to weight them. They work remarkably well on many machine learning tasks even though the selected features are non-adaptive.

The success of SVMs dampened the earlier enthusiasm for neural networks. More recently, however, it has been shown that multiple layers of feature detectors can be learned greedily, one layer at a time, by using unsupervised learning that does not require labeled data. The features in each layer are designed to model the statistical structure of the patterns of feature activations in the previous layer. After learning several layers of features this way without paying any attention to the final goal, many of the high-level features will be irrelevant for any particular task, but others will be highly relevant because high-order correlations are the signature of the data's true underlying causes and the labels are more directly related to these causes than to the raw inputs. A subsequent stage of fine-tuning using backpropagation then yields neural networks that work much better than those trained by backpropagation alone and better than SVMs for important tasks such as object or speech recognition.1,2,4 The neural networks outperform SVMs because the limited amount of information in the labels is not being used to create multiple features from scratch; it is only being used to adjust the class boundaries by slightly modifying the features.

The following paper by Lee et al. is the first impressive demonstration that greedy layer-by-layer feature creation can be applied to large images. To make this work, they had to use replicated local feature detectors, and they had to solve a tricky technical problem in probabilistic modeling of convolutional neural networks. These networks summarize the outputs of nearby copies of the same feature detector by simply reporting their maximum value. For unsupervised learning to work properly, this operation must be given a sensible probabilistic interpretation. The authors solve this problem by using a soft, probabilistic version of the maximum function, and they show that this allows them to learn an impressive feature hierarchy in which the first layer represents oriented edge filters, the second layer represents object parts, and the third represents larger parts or whole objects. They also show their model can combine bottom-up and top-down inference, using more global context to select appropriately between local features.

The learning algorithm used by the authors is designed to produce a composite generative model called a "deep belief net,"3 but they perform top-down inference as if it were a different generative model called a "deep Boltzmann machine." They achieve quite good results at image completion, and even better results might be obtained if they fine-tuned their generative model as a deep Boltzmann machine using a recent algorithm developed by Salakhutdinov.7

Machine learning still has some way to go before it can efficiently create the complicated features like SIFT6 used in many leading systems for computer vision. However, this paper should seriously worry those computer vision researchers who still believe that hand-engineered features have a long-term future. Further improvements from unsupervised learning also seem likely: biology tells us that applying high-resolution filters across an entire image is not the best way to use a neural net, even if it has billions of neurons.

Back to Top


1. Bengio, Y., Lamblin, P., Popovici, D. and Larochelle, H. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems. B. Schoelkopf, J. Platt, and T. Hoffman, Eds. MIT press, Cambridge, MA, 2007, 19.

2. Dahl, G., Mohamed, A. and Hinton, G.E. Acoustic modeling using deep belief networks. IEEE Trans. on Audio, Speech, and Language Processing 19, 8 (2011).

3. Hinton, G.E., Osindero, S. and the, Y.T. A fast learning algorithm for deep belief nets. Neural Computation 18 (2006).

4. Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 313 (2006), 504507.

5. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE 86, 11 (1998), 22782324.

6. Lowe, D.G. Object recognition from local scale-invariant features. In Proc. International Conference on Computer Vision, 1999

7. Salakhutdinov, R.S. Learning Deep Generative Models. PhD thesis, University of Toronto, 2009.

8. Vapnik, V.N. The Nature of Statistical Learning Theory. Springer, New York, NY, 2000.

Back to Top


Geoffrey E. Hinton ( is a professor of computer science at the University of Toronto, Canada.

©2011 ACM  0001-0782/11/1000  $10.00

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents:
  • Article
  • References
  • Author