Research and Advances
Artificial Intelligence and Machine Learning Research highlights

Technical Perspective: Understanding Pictures of Rooms

Posted
  1. Article
  2. Author
Read the related Research Paper

The rich world is getting older, so we will see many efforts to build robots that can provide some in-home care for frail people. These robots will need computer programs that can see and understand rooms because rooms are where people live much of their lives. These robots could use, for example, motion or range data to measure the shape of the room. This does not always apply; motion data might not be available. For example, when a robot first opens a door, it should likely determine whether it might fall before it moves. In some applications, range data is not available either.

But how can we understand a room from a single picture? A picture shows us a two-dimensional projection of a three-dimensional world. This seldom confuses people. We can usually report where objects are in a scene without difficulty, and we can usually get the answer about right, too. We can reason about the empty space we see in pictures, and answer questions like: Would a bed fit here? Is the table far away from the chair? And our answers are usable, if not perfect.

There are now very accurate computer vision methods for reconstructing geometry from videos or from collections of pictures. These methods operate at prodigious scales (substantial fractions of cities have been reconstructed) and with high accuracy. Recovering geometry from a single image still presents important puzzles. A precise representation of all geometric information is likely too much to ask for except in very special cases.

Even rough estimates of geometric information are surprisingly useful. Some years ago, Derek Hoiem and colleagues showed that estimating the horizon in an outdoor scene could improve the performance of a pedestrian detector. This works because real pedestrians have feet below the horizon (they can't levitate). Perspective effects mean the closer a pedestrian's feet are to the horizon, the smaller the image region the pedestrian should occupy. Detector outputs that do not follow this rule are likely wrong, and discarding them significantly improves overall performance.

What is not yet known is (a) what is useful, (b) what is available, and (c) how to balance errors in the representation recovered. The primary sources of error are bias—where the method cannot represent the right answer, and so must give an answer that is wrong—and variance—where the method could represent the right answer, but becomes overwhelmed by the need to estimate too much information. A representation that tries to recover the depth and the surface normal at every image pixel will likely get most pixels wrong, and so have variance problems, because the image is savagely ambiguous. Representing a room as a box incurs bias; the representation is usually wrong because rooms very often are not exactly boxes, and because there is usually other geometry (beds, chairs, tables) lying around.

For many applications, it is enough to find a box that is a good approximation to the room. Varsha Hedau and colleagues have demonstrated that knowing a fair box estimate makes it easier to detect beds and large items of furniture. Kevin Karsch and colleagues show how to use an approximate box to infer the lighting in a room, and so insert correctly shaded virtual objects into real pictures. Getting a good approximate box is difficult, because the edges and corners of the room are typically hidden by furniture, which is unceremoniously called "clutter" in the literature. So we must fit a box to the reliable features in the image, and discount the furniture when we do so.


How can we understand a room from a single picture? A picture shows us a two-dimensional projection of a three-dimensional world.


Wang, Gould and Koller's work, detailed in the following paper, is the best current method to do this. Their method rests on two important points. First, they show how to learn a function to score room hypotheses. This function is trained such that the best scoring room will tend to be close to the right answer. Therefore, to fit a room to a new picture, one searches for the best scoring room. Second, clutter tends to be consistent in appearance and to be in consistent places. For example, beds tend to be on the floor next to a wall. So one can tell which parts of the image to discount when computing the scoring function: it looks like clutter, and it is where clutter tends to be.

Their method scores better than any other on the current standard test of accuracy for estimating rooms. Moreover, as figures 3 and 4 in the paper illustrate, there is more to come. Knowing what parts of the image are clutter gives us very strong cues to where the furniture is. There will soon be methods that can produce very detailed maps of room interiors, fit for robots to use.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More