Your portable phone can beat you at chess, but can it recognize a horse? Bristling with cameras, microphones, and other sensors, today’s machines are nevertheless essentially deaf and blind; they do not have senses to interact with their environment. In the meantime, vast amounts of valuable sensory data is captured, transmitted, and inexpensively stored every day. TV programs and movies, fMRI scans, planetary surveys, footage from security cameras, and digital photographs pile up and lie fallow on hard drives around the globe. It is all too much for humans to organize and access by hand. Someone has appropriately called this the "data deluge." Automating the process of analyzing sensory data and transforming it into actionable information is one of the most useful and difficult challenges of modern engineering.
How shall we go about building machines that can see, hear, smell, touch? Sensory tasks come in all shapes and forms: reading books, recognizing people, or hitting tennis balls. It is expeditious to approach each one as a separate problem. However, one remarkable fact about our own senses is they adapt easily to new environments and tasks. Our senses evolved to help us navigate and forage among trees, rocks, and grass, as well as enable us to socialize with people. Despite this history, we can train ourselves to read text, to recognize galaxies in telescope images, and to drive fast-moving vehicles. Discovering general laws and principles that underlie sensory processing might one day allow us to design and build flexible and adaptable sensory systems for our machines.
In the following paper, Torralba, Murphy, and Freeman are concerned with visual recognition. They explore one principle that has general validity: the use of context. The authors propose an elegant and compelling demonstration showing that context is crucial for recognizing an object when the image has poor resolution and, as a result, the object’s picture is ambiguous. That context may be useful in visual recognition is rather intuitive. However, to design a machine that makes use of context we must first define what context is, exactly how should one measure it, and how these measurements may be used to recognize objects.
The context of an object is a rich and complex phenomenon, and it is not easily defined. The identity of the scene (suburban street, kitchen) where the object is found could be thought of as its context. The identity of the surfaces and objects present in the scene (two automobiles, a pedestrian, a fire hydrant, a building’s facade), as well as the mutual position of such surfaces and objects, are also considered context. So, too, is the weather, lighting conditions, time of day, historical period, and other circumstances. Where should one begin? What should one measure? One could worry that the entire problem of vision must be solved before one is able to define and compute context. It is not surprising that most researchers to date have sidestepped this baffling chicken-and-egg issue.
Discovering general laws and principles that underlie sensory processing might one day allow us to design and build flexible and adaptable sensory systems for our machines.
The authors avoid computing explicit scene semantic information. They start instead by considering easyto-compute, image-like quantities that correlate with context. Inspired by what we know about the human visual system, they compute statistics of the output of wavelet-like linear filters applied to the image. These statistics capture some aspects of the visual statistics of the scene that, in turn, are indicative of its overall nature: for example, long and vertical structure in a forest, sparse horizontal structure in open grassland. Filter statistics are thus correlated to scene type. Torralba, Murphy, and Freeman call the ensemble of their measurements "gist," a term used in psychology to denote the overall visual meaning of a scene, which has been shown to be perceived quickly by human observers.1,2
The authors find that, surprisingly, their filter-based gist is rather good at predicting the number of instances of a given object category that might be present in the scene, as well as their likely position along the y-axis. Combining this with information coming from object detectors operating independently at each location produces an overall score for the presence of an object of a given class at location (x; y). This is more reliable than using the detectors alone. It looks like it is finally open season on visual context.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment