Artificial Intelligence and Machine Learning Research highlights

Technical Perspective: Progress in Visual Categorization

Posted Sep 1 2013

Article
References
Author

Our visual system helps us carry out our daily business: walking, driving, reading, playing sports, or socializing. It is difficult to think of an activity that does not depend on vision. Our eyes and brain help us by measuring shapes, trajectories, and distances in world around us, and by recognizing materials, objects, and scenes. How is this done? Can we reproduce these abilities in a machine?

The following paper by Felzenszwalb et al. describes what is currently the best system for detecting object categories (a pedestrian, a bottle, a cat) in images. Like much work in computer vision, their system is built upon insight coming from a diverse set of areas of science and engineering: biological vision, geometry, signal processing, machine learning, and computer algorithms.

Three ingredients make their system successful. First, objects are described as collections of visually distinctive parts (for example, eyes, nose, and mouth in a face) that appear in a consistent, although not rigid, mutual position, or shape. This idea may be traced back to Fischler and Elschlager,⁶ although much work was necessary to make it work in practice; for example, making representations invariant to scale, representing the fact that parts are sometimes occluded and thus invisible, and giving shape and occlusion probabilistic interpretation.²

The second ingredient is representing parts (eyes, among others) using patterns of local orientations in the image. This simple idea makes a big difference. It turns out that orientation is less sensitive to changes in lighting conditions and viewpoint than pixel values. This observation comes from studying biological vision systems⁴ and is the foundation of the most successful descriptors for image patches: shape contexts, SIFT, and HOG.^1,3,7 The authors here add one twist to the idea: rather than building detectors based on what the part looks like, it is better to build detectors as discriminative classifiers; that is, optimizing their ability to tell the difference between a given part (for example, the head of a pedestrian) and the environment that typically surrounds it (bookshelves, the shoulders, and arms of the pedestrian).

The third ingredient is an efficient search algorithm, originating with Felzenszwalb’s thesis,⁵ which detects an object in a handful of seconds, focusing computation only on the most promising areas of the image.

Is detecting visual categories a solved problem? The reader will be amused by how poorly our best algorithms work. A quick perusal of Table 1 in Felzenszwalb et al. will reveal that, on a good day, less than half of the people are detected in the PASCAL VOC dataset. Boats and birds are even more difficult to find. This is precisely what makes computer vision an exciting field of research today: there is much progress to be made; we are still a few big ideas away from the ultimate design. Twenty years ago we only had nebulous ideas about how to approach visual categorization, and 10 years ago the performance numbers would have probably been in the few percent.

What is missing? Quite a few things; I will mention a couple. First of all, our models are purely phenomenological, based on statistics of how objects look in 2D images. We do not take into account 3D geometry, nor the properties and materials of surfaces. Second, today’s goal is to recognize widely different categories: bottle vs. cat vs. person. There is a whole world of fine distinctions, for example, Anopheles vs. Culex mosquito, Siamese vs. Burmese cat. We do not yet know how to handle such fine-grained classifications. Third, people can learn to recognize new categories with just a few training examples; how many femurs does a medical student need to see to learn the category? Our algorithms must see thousands of training examples to become halfway decent. The mother of all challenges is scaling: there are millions of meaningful visual categories to recognize (10⁵ vertebrate species, 10⁷ insect species, not to speak of shoes, wristwatches, and handbags). We need to develop systems able to train themselves by using information available on the Web, and that are able to tap into the expertise of knowledgeable humans by asking them intelligent questions.

A growing number of talented researchers are hard at work tackling these questions. It is an exciting moment for computer vision. Stay tuned.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Technical Perspective: Progress in Visual Categorization

View in the ACM Digital Library

DOI

10.1145/2500468.2500480

September 2013 Issue

Published: September 1, 2013

Vol. 56 No. 9

Page: 96

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Technical Perspective: Progress in Visual Categorization

DOI

September 2013 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.