# Communications of the ACM

Research highlights

# Technical Perspective: Understanding Pictures of Rooms

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook

The rich world is getting older, so we will see many efforts to build robots that can provide some in-home care for frail people. These robots will need computer programs that can see and understand rooms because rooms are where people live much of their lives. These robots could use, for example, motion or range data to measure the shape of the room. This does not always apply; motion data might not be available. For example, when a robot first opens a door, it should likely determine whether it might fall before it moves. In some applications, range data is not available either.

But how can we understand a room from a single picture? A picture shows us a two-dimensional projection of a three-dimensional world. This seldom confuses people. We can usually report where objects are in a scene without difficulty, and we can usually get the answer about right, too. We can reason about the empty space we see in pictures, and answer questions like: Would a bed fit here? Is the table far away from the chair? And our answers are usable, if not perfect.

There are now very accurate computer vision methods for reconstructing geometry from videos or from collections of pictures. These methods operate at prodigious scales (substantial fractions of cities have been reconstructed) and with high accuracy. Recovering geometry from a single image still presents important puzzles. A precise representation of all geometric information is likely too much to ask for except in very special cases.

Even rough estimates of geometric information are surprisingly useful. Some years ago, Derek Hoiem and colleagues showed that estimating the horizon in an outdoor scene could improve the performance of a pedestrian detector. This works because real pedestrians have feet below the horizon (they can't levitate). Perspective effects mean the closer a pedestrian's feet are to the horizon, the smaller the image region the pedestrian should occupy. Detector outputs that do not follow this rule are likely wrong, and discarding them significantly improves overall performance.

What is not yet known is (a) what is useful, (b) what is available, and (c) how to balance errors in the representation recovered. The primary sources of error are biaswhere the method cannot represent the right answer, and so must give an answer that is wrongand variancewhere the method could represent the right answer, but becomes overwhelmed by the need to estimate too much information. A representation that tries to recover the depth and the surface normal at every image pixel will likely get most pixels wrong, and so have variance problems, because the image is savagely ambiguous. Representing a room as a box incurs bias; the representation is usually wrong because rooms very often are not exactly boxes, and because there is usually other geometry (beds, chairs, tables) lying around.

For many applications, it is enough to find a box that is a good approximation to the room. Varsha Hedau and colleagues have demonstrated that knowing a fair box estimate makes it easier to detect beds and large items of furniture. Kevin Karsch and colleagues show how to use an approximate box to infer the lighting in a room, and so insert correctly shaded virtual objects into real pictures. Getting a good approximate box is difficult, because the edges and corners of the room are typically hidden by furniture, which is unceremoniously called "clutter" in the literature. So we must fit a box to the reliable features in the image, and discount the furniture when we do so.

How can we understand a room from a single picture? A picture shows us a two-dimensional projection of a three-dimensional world.

Wang, Gould and Koller's work, detailed in the following paper, is the best current method to do this. Their method rests on two important points. First, they show how to learn a function to score room hypotheses. This function is trained such that the best scoring room will tend to be close to the right answer. Therefore, to fit a room to a new picture, one searches for the best scoring room. Second, clutter tends to be consistent in appearance and to be in consistent places. For example, beds tend to be on the floor next to a wall. So one can tell which parts of the image to discount when computing the scoring function: it looks like clutter, and it is where clutter tends to be.

Their method scores better than any other on the current standard test of accuracy for estimating rooms. Moreover, as figures 3 and 4 in the paper illustrate, there is more to come. Knowing what parts of the image are clutter gives us very strong cues to where the furniture is. There will soon be methods that can produce very detailed maps of room interiors, fit for robots to use.

### Author

David Forsyth (daf@illinois.edu) is a professor in the Thomas M. Siebel Center for Computer Science at the University of Illinois, Urbana, IL.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

No entries found