Seeing Is Not Enough

The Mind's Eye project has the ambitious goal of teaching cameras to recognize and name the "nouns" and "verbs" of actions, such as one person giving an object to another person.

A man waits by the public square of a war-torn city. A woman walks by and hands him something, then quickly walks away. He waits there another 15 minutes before someone else gives him another package and sprints off. The subject hurriedly puts together the two pieces, walks to the center of the square, puts down the new assemblage, and leaves.

What are the pieces? What did they compose? Why did the man leave after putting them together? The answers to these questions could presage something horrible—like a bomb explosion—or something ordinary, like a city worker installing signage. A human observer would know to follow up, perhaps by examining the placed object or detaining the man who put it there. But for a surveillance camera, these actions are no more suspicious than those of an ice cream vendor or of kids playing soccer. The camera sees, but it does not understand.

A program from the U.S. governmental agency Defense Advanced Research Projects Agency (DARPA) aims to change that with its Mind’s Eye program, which first sought participants in March 2010, launched that September, and announced its 15 contractors in January 2011. The five-year program provides funding for 12 research teams to develop “fundamental machine-based visual intelligence” (VI), as well as three implementation teams that will integrate VI technologies with portable, camera-bearing, combat-ready unmanned ground vehicles. Funding for the first year totals about $5 million, or an average of about $333,000 per contractor. It will be followed by $10 million the next year and $16 million the next, with further funding to be determined as the program progresses. To ensure the final products have military usefulness, DARPA has engaged the Army Research Laboratory as its “customer” throughout the process.

According to DARPA Program Manager James Donlon, action recognition research has traditionally focused on narrowly defined problems, solved with incrementally higher degrees of performance. By comparison, the Mind’s Eye project aims to markedly advance the field by converting video streams to simple descriptions of the actions they depict. “DARPA’s in the perfect role to identify problems that are almost ridiculously difficult, compared to the current state of the art,” Donlon says. “We know that performance will be lower than you’d have on a set of tightly controlled data. But then the questions are, What did we learn as a result? What needs to be developed next to get better performance?”

Seeing Things as They Are

The Mind’s Eye project requires researchers to attack four tasks: Recognition of actions in a scene; description of the actions being performed; gap-filling to make accurate assumptions of what’s left out of a scene, including predictions of what came before and what will follow after it; and anomaly detection to identify actions that are unusual in the context of the entire video. It builds on past achievements in object recognition—the “nouns” of VI—to establish methods to recognize the “verbs” of action. To focus efforts, DARPA has identified 48 specific verbs of interest such as “approach,” “fly,” and “walk.”

Action recognition is a surprisingly difficult task for a VI system, although humans do it without thinking. First, the system must separate active objects from the background—a task the world makes difficult with such distractions as tree branches blowing in the breeze. Even after a VI system discounts irrelevant movement, the scene could contain multiple active objects that require examination, and the crucial action could depend on one, some, or all of them. As Lauren Barghout, founder of the vision technology firm Eyegorithm, describes it, “You can refer to something as a group of objects or nouns—’two cupcakes’. If you employ the ‘spotlight theory of attention,’ you pay attention to the area they occupy, but you might find that the center of that area is just an empty space. Or you could follow the ‘object-based’ theory, in which case you have to determine whether the ‘object’ is one cupcake or both cupcakes.”

The visual system must also recognize and deconstruct those objects. Takeo Kanade, a professor at Carnegie Mellon University’s Robotics Institute, points out that some objects are easier to recognize than others. “We can train for limited sets of objects pretty well, once we know their expected appearances—the human face, the human body, cars, things like that. But there are lots of objects whose appearance is completely unknown or unspecifiable. Take ‘package’. People who use that word only know that it’s three-dimensional, probably box-like, and made of paper. It tends to be from hand-size to body-size, because we call it something else if it’s bigger.” The human face, on the other hand, is patterned after a well-defined template and is therefore easier for VI systems to identify.

From Object to Actor

Once the VI system recognizes objects with some accuracy, it must understand how their movements and interactions comprise actions. Bruce Draper, an associate professor of computer science at Colorado State University, believes a successful system will need to learn that without explicit training. “You cannot go in and predefine every object, every action, every event,” he says. “You don’t know where the system’s going to be deployed, and the world changes all the time. We need to build systems that learn from watching the video stream without ever being told what’s in it.” The key, he says, is for the system to recognize repeated actions. “If we had only one video of someone throwing a football, there’d be no repeated pattern to learn from,” says Draper. “But with several, it can learn that pattern as unique.”

After the VI system can recognize actions, it still has to name them. NASA Jet Propulsion Laboratory Principal Investigator Michael Burl believes that a VI system should do more than just flash “run” and “approach” on the screen when those actions occur. “We want to go beyond text descriptions by extracting and transmitting a ‘script’ that can be used to regenerate what happened in the video,” he says. To that end, Burl and co-investigator Russell Knight are using planning-executing agent (PEA) graphical models to provide an abstract representation of various behaviors. “Take ‘throw,'” says Burl. “It takes two arguments: The agent doing the throwing, and the object being thrown. For the action itself you’d expect to see transitions between several states, such as a windup, forward motion of the arm, then the concepts of ‘separate’ and ‘fly’ being applied to the object. PEA models are hierarchical so complex actions can be composed from simpler ones. By identifying the PEA models being used by the agents, we obtain a compact, generative script that provides a summary of the full video.”

Beyond the Battlefield

DARPA’s Donlon says the target verbs were chosen to be both relevant and wide-ranging. “Some verbs are what soldiers would need to know on the battlefield, but the list has a lot of diversity,” he notes. In fact, all of the listed verbs have some applicability in non-military situations, raising the question: What will VI technology be like when it reaches the civilian sphere?

“We’re getting a lot of interest in non-military applications” says Draper. “For example, the National Institutes of Health wants to figure out what kids are doing on the play-ground. Not what any individual child is doing, but which pieces of equipment are encouraging them to be active. And there’s an existing surveillance market that detects motion and determines whether it’s caused by a person. That’s wonderful to protect a perimeter, but what if someone grabs their chest and falls down in the middle of a public square? That’s an activity that’s wrong, not just a matter of ‘someone who shouldn’t be there.'”

However, the presence of intelligent cameras in the public sector could have a chilling effect, says Jay Stanley, senior policy analyst in the Speech, Privacy and Technology Program of the American Civil Liberties Union. “It’s fine to use this technology in military applications on overseas battlefields or in certain law enforcement situations where there’s probable cause and a warrant,” he says. “And perhaps it could be used by individuals if it becomes integrated into consumer products, the way face recognition has in a limited way. But there are two classes of concerns. The first is that it works really well, and so intensifies existing concerns about surveillance. The second is that it works very poorly. False alarms can be just as bad for people as accurate ones, and there are all kinds of gradations in what a false positive is. Perhaps the computer correctly interprets your behavior but there’s a perfectly innocent explanation. Or perhaps the computer totally misunderstands your behavior. And there are a lot of points in between.”

Donlon believes the benefits of the Mind’s Eye project will outweigh such risks. “One of the things inspiring about visual intelligence is that, if we can solve this range of tasks on this range of verbs—even partially—there will be a wide range of both military and commercial applications,” he says. “We use one potential military outcome as the ultimate goal, but it’s just one exemplar, really, of the benefit we’ll get. It just seems intuitively obvious that there’s a very rich potential commercial market for smart cameras—for commercial security applications, or for loss prevention in retail. For all of them, attending to alerts and dismissing false alarms is a lot less eyeball-intensive than staring at a video feed 24/7.”

Further Reading

Barghout, L. Empirical data on the configural architecture of human scene perception, Journal of Vision 9, 8, August 5, 2009.

Draper, B. Early results in micro-action detection, http://www.cs.colostate.edu/~draper/newsite/index.php/research/visualintelligence-through-latent-geometryand-selective-guidan/early-micro-actiondetection/, Colorado State University video, January 2011.

Lui, Y., Beveridge, J., and Kirby, M. Action classification on product manifolds, 2010 IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 1318, 2010.

O’Hara, S., Lui, Y., and Draper, B.A. Unsupervised learning of human expressions, gestures, and actions, 2011 IEEE Conference on Automatic Face and Gesture Recognition, Santa Barbara, CA, March 2125, 2011.

Poppe, R. A survey on vision-based human action recognition, Image and Vision Computing 28, 6, June 2010.

Figures

Figure. The Mind’s Eye project has the ambitious goal of teaching cameras to recognize and name the “nouns” and “verbs” of actions, such as one person giving an object to another person.

Seeing Things as They Are

From Object to Actor

Beyond the Battlefield

Figures

Seeing Is Not Enough

DOI

October 2011 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Seeing Things as They Are

From Object to Actor

Beyond the Battlefield

Figures

Seeing Is Not Enough

DOI

October 2011 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.