Vision systems give animals two, somewhat different, kinds of information. The first is a model of the world they see. Our vision systems tell us where free space is (and so, where we could move); what is big and what is small; and what is smooth and what is scratchy.
Research in computer vision has now produced very powerful reconstruction methods. These methods can recover rich models of complex worlds from images and from video, and have had tremendous impact on everyday life. If you have seen a CGI film, you have likely seen representations recovered by one of these methods.
The second is a description of the world in terms of objects at a variety of levels of abstraction. Our vision systems can tell us that something is an animal; that it is a cat; and that it is the neighbor’s cat. Computer vision has difficulty mimicking all these skills. We have really powerful methods for classifying images based on two technologies. First, given good feature vectors, modern classifiers—functions that report a class, given a feature vector, and that are learned from data—are very accurate. Second, with appropriate structural choices, one can learn to construct good features—this is the importance of convolutional neural networks. These methods apply to detection, too. One detects an object by constructing a set of possible locations for that object, then passing them to a classifier. Improvements in image classification and detection are so frequent that one can only keep precise track of the current state of the art by haunting ArXiV.
There remains a crucial difficulty: What should a system report about an image? It is likely a bad idea to identify each object in the image because there are so many and most do not matter (say, the bolt that holds the left front leg to your chair). So a system should report mainly objects that are important. The system now needs to choose a name for each object that it reports. Many things can have the same name, because a world that consists entirely of wholly distinct things is too difficult to deal with. But the same thing could have many names, and choosing the best name becomes an issue. For example, when you see a swan, being told it is a "bird" is not particularly helpful (chickens are quite widely eaten), and you would probably expect better than "Cygnus olor," too, because it might be colombianus. But when they see a serval, people who have not encountered one before would feel their vision system was doing its job if it reported "fairly large cat."
Psychologists argue there are basic-level categories that identify the best name for a thing. The choice of a basic-level category for a thing seems to be driven by its shape and appearance. For example, a sparrow and a wren might be in one basic-level bird category, and an ostrich and a rhea would be together in a different one. From the practical point of view, this idea is difficult to use because there is not much data about what the basic-level category for particular objects is.
In the following paper, the authors offer a method to determine a basic-level category name for an object in an image. The term one uses should be natural—something people tend to say.
In the following paper, the authors offer a method to determine a basic-level category name for an object in an image. The term one uses should be natural—something people tend to say. For example, one could describe a "King penguin" as such, or as a "seabird," a "bird" or an "animal;" but "penguin" gives a nice balance between precision and generality, and is what most people use. The authors show how to use existing linguistic datasets to score the naturalness of a term. The term one uses should also tend to be correct for the image being described. More general terms are more likely to be correct (one can label pretty much anything "entity"). The authors show how to balance the likely correctness of a term, using a confidence score, with its naturalness.
Another strategy to identify basic-level categories is to look at terms people actually use when describing images. The authors look at captioned datasets to find nouns that occur often. They represent images using a set of terms, produced by one classifier. They then build another classifier to predict commonly occurring nouns from the first set. They require that most terms make no contribution to the predicted noun by enforcing sparsity in the second classifier; as a result, they can see what visual terms tend to produce which nouns (as Figure 6 illustrates, the noun "tree" is produced by a variety of specialized terms to do with vegetation, shrubbery, and so on). The result is an exciting link between terms of art commonly used in computer vision, and the basic categories of perceptual psychology.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment