Home/Magazine Archive/April 2013 (Vol. 56, No. 4)/Discriminative Learning with Latent Variables For.../Full Text

Research highlights
# Discriminative Learning with Latent Variables For Cluttered Indoor Scene Understanding

We address the problem of understanding an indoor scene from a single image in terms of recovering the room geometry (floor, ceiling, and walls) and furniture layout. A major challenge of this task arises from the fact that most indoor scenes are cluttered by furniture and decorations, whose appearances vary drastically across scenes, thus can hardly be modeled (or even hand-labeled) consistently. In this paper we tackle this problem by introducing latent variables to account for clutter, so that the observed image is jointly explained by the room and clutter layout. Model parameters are learned from a training set of images that are only labeled with the layout of the room geometry. Our approach enables taking into account and inferring indoor clutter *without* hand-labeling of the clutter in the training set, which is often inaccurate. Yet it outperforms the state-of-the-art method of Hedau et al.^{7} that requires clutter labels. As a latent variable based method, our approach has an interesting feature that latent variables are used in direct correspondence with a concrete visual concept (clutter in the room) and thus interpretable.

We address holistic understanding of indoor scenes from a single image. Our model takes an image of an indoor scene as input, and produces boundaries between the floor, the walls, and the ceiling (we call them the *box layout*), as well as segmentation of the clutter such as furniture and decorations (Figure 1). Learning the model parameters requires training images with hand-labeled box layout but not the clutter significantly reducing labeling effort.

Holistic scene understanding has attracted much attention in computer vision.^{5,9,10,11} One of the goals is to make use of *image context* to help improve traditional vision tasks, such as object detection. The image context can be represented by many aspects. For example, one could use a category label (e.g., street and kitchen), as it imposes a strong prior on the likely and unlikely objects to be detected (e.g., cars on the street). In this paper we focus on indoor scenes, for which we represent the image context by a box layout of the room and clutter. On one hand, such knowledge could be useful as a geometric constraint in a variety of traditional computer vision tasks such as object detection^{14} and motion planning.^{6} On the other hand it could also be used as input to infer more fine-grained contextual information such as the 3D layout.

Given an image of an indoor scene, we perform two preprocessing steps. First, we detect the *vanishing points*, which play an essential role in characterizing the structure of an indoor scene. Due to the effect of perspective projection, a set of parallel lines in 3D are either parallel or intersect at a common point in the 2D image. The points where sets of 3D parallel lines meet in 2D are called vanishing points. Hedau et al.^{7} observed that most indoor scenes are characterized by three vanishing points. Given these points, we can generate a family of box layouts. Specifically, as shown in Figure 2, we detect long lines in the image, and cluster them into three dominant groups corresponding to three vanishing points *vp*_{0}, *vp*_{1}, and *vp*_{2}. Candidate box layouts can be generated by extending two rays from *vp*_{0}, two rays from *vp*_{1}, and connecting the four intersections with *vp*_{2}. Infinitely many candidate box layouts can be generated in this manner, and the task becomes to identify the one that best fits the observed image.

Our method also infers clutter within the scene as shown in the second row of Figure 1. It is computationally expensive to perform reasoning on each pixel as to whether or not it belongs to clutter. We therefore perform an over-segmentation of the image using the mean-shift algorithm^{2} and reason about the so-called superpixels, that is, we reason over small contiguous segments that have been predetermined according to their spatial and appearance properties. We typically have less than a hundred segments (superpixels) for each image. In reasoning about the clutter we treat each segment as a single entity.

Building on these preliminaries, we introduce our model, inference and learning approach in the next section. Experimental results are discussed in Section 3.

**2.1. Model**

We use ** x** to denote the input, that is, the image together with its over-segmentation and vanishing points. To capture the appearance of pixels, we use a 21 dimensional vector to represent each pixel, including various color and texture cues. They are also included in

As we have mentioned, the box layout is determined by four parameters (two rays sent from *vp*_{0} and two from *vp*_{1}). We therefore parameterize ** y** as

The latent variable ** h** is a binary vector with one entry for each segment in the over-segmented image. The entries indicate whether or not each corresponding segment belongs to the clutter. The variables are called "latent" because they are never observed, that is, we do not require the clutter layout to be labeled in the training images. In fact, drawing the boundaries of the furniture and decorations by hand is not only time-consuming but also ambiguous in many cases. For example, should windows and floor rugs be labeled as clutter? In spite of these difficulties, accounting for clutter appropriately is essential to the success of modeling the scene geometry. This is because the clutter often obscures the geometric structure and occludes boundaries between faces. Moreover, appearance and layout of clutter can vary drastically across different scenes, so it is extremely difficult (if not impossible) to model it consistently. Our latent variable approach effectively addresses this issue.

Our energy-based model measures the consistency among ** x, y**, and

The joint feature mapping ^{n} is a vector that contains many features that take into account image cues from various aspects including color, texture, perspective consistency, and overall layout. The parameter vector ** w** specifies the weights of these features and is learned from the labeled training set.

The prior energy term **E**^{0} consists of two parts,

Here **E**^{a} summarizes the appearance variance of each major face excluding all clutter segments. This encodes the prior belief that the major faces should have a relatively consistent appearance after the clutter is taken out. Specifically,

where is the set of five major faces: *floor, ceiling, left wall, front wall, right wall*, is the set of 21 appearance features, and **1**(·) is the indicator function returning one if its argument is true and zero otherwise. Here *p* is a pixel; *p*_{a} is its appearance feature value indexed by *a*; is the average of that appearance feature value within face *f*; and *p*_{h} is the value of the latent variable at that pixel. We define *p*_{h} = 0 to mean that pixel *p* is *not* a clutter. Note, however, that this term could be minimized by assigning almost everything as clutter and leaving only a tiny uniform piece with very consistent appearance. To avoid such degenerate solutions we introduce the second term:

which penalizes "clutterness" of each face. We adopt the exponential form because it exhibits superlinear penalty as the percentage of clutter increases. We will address how to determine the parameters *, *^{a}*, *^{c}, as well as ** w** in Section 2.3.

The features in are defined in a similar spirit: capturing the synergy of ** y** and

We have features to account for face boundaries. Ideally the boundaries between the five major faces should either be explained by a long line or occluded by some furniture. Therefore we introduce two features for each boundary: the percentage of its length in clutter regions (i.e., occlusion), and the percentage of its length that approximately overlaps with a detected line segment.

We also have features for perspective consistency, which we adopt from Hedau et al.^{7} Note that the lines in the image fall into three groups corresponding to the three vanishing points (as in Figure 2). For each face, we are more likely to observe lines from two of the three groups. For example, on the front wall we are more likely to observe lines belonging to *vp*_{0} and *vp*_{1}, but not *vp*_{2}. We capture this with features that quantify the total length of line segments from each of the three groups in each of the five faces, and have a separate feature value for clutter and non-clutter regions.

In addition we have features that measure cross-face differences. For the 21 appearance values, we compute the difference between each pair of adjacent faces excluding clutter. Finally, we have features for some global properties of the box layout. For each of the five major faces, we use a binary feature indicating whether or not it is absent, and its percentage area in the image. For each of the four parameters , we compute their likelihood under *p*_{0}(** y**) as bias features.

**2.2. Approximate inference**

For now we assume that all model parameters are fixed and we want to solve the minimization problem (2). Because the joint feature mapping and prior energy **E**^{0} are defined in a rather complex way, it cannot be solved analytically. Our inference procedure is based on ICM (Iterated Conditional Modes^{1} as shown in Algorithm 1. The basic idea is to iteratively perturb ** h** and

Complexity of the algorithm depends on a number of design choices. For example, a larger number of segments (dimensionality of ** h**) may be able to model the clutter at a finer scale but could potentially make inference slow to converge as introducing more latent variables generally increases the number of iterations required by the ICM algorithm. In our experimental setting, the inference running time is typically one or two minutes for each image.

In our experiments, we also compare to another baseline inference method that does not make use of the continuous parametrization of ** y**. Specifically, we independently generate a large number of candidate boxes from

**2.3. Parameter learning**

First, consider the parameters in the prior term **E**^{0}. When we introduce the latent variables we bear in mind that they should account for the *clutter* such as chairs, desks, sofas, *etc*. However, the algorithm has no access to any supervised information on the latent variables. Given the limited training data, it is hopeless to expect the learning process to figure out the concept of *clutter* by itself. To tackle this, we introduce the prior term **E**^{0} to capture the concept of clutter and constrain the learning process. Specifically, the parameters in **E**^{0}, namely ^{a}*, *^{c}, and , are determined by cross-validation on the training set and fixed throughout the learning process.

Given the training set , we learn the parameters ** w** discriminatively using struct-SVM,

where (* y, y_{i}*) is the loss function that measures the difference between the candidate output

Roughly speaking, the learning objective states that the true box layout * y_{i}*, when accompanied with proper clutter estimation

Optimizing the learning objective is difficult because the number of constraints in equation (7) is infinite. Even if we discretize the parameter space of ** y** in some way, the total number of constraints is still intractably large. And each constraint involves an embedded inference problem for the latent variables. Generally this is tackled by the

where the latent variables * h_{i}* should take the value that best explains the ground truth box layout under current model parameters:

Equations (8) and (9) are solved by the same inference method as that we introduced in Section 2.2. However, we use a looser convergence criterion to speed up loss augmented inference as it has to be performed a large number of times in learning. The overall learning algorithm (following Tsochantaridis et al.^{17}) is shown in Algorithm 2.

We experimentally verified our method on the dataset introduced by Hedau et al.^{7} The dataset consists of 314 images, each with hand-labeled box layout and clutter. It also specifies the training-test split (209 for training, 105 for test), which we use for reporting results. Performance is measured by pixel-error-rate: the percentage of misclassified

pixels in the task of classifying them into one of the five classes. Our approach achieves an error-rate of 20.1% *without* clutter labels, compared to 26.5% in Hedau et al.^{7} without clutter labels and 21.2% with clutter labels. Details are shown in Table 1.

Each row in Table 1 shows a different performance metric and each column represents a different algorithm. Briefly we have: *Row 1*: pixel error rate. *Row 2 and 3*: the number of test images (out of 105) with pixel error rate under 20% and 10%. *Column 1*: Hoiem et al.'s algorithm. *Column 2*: Hedau et al.'s method without clutter label. *Column 3*: Hedau et al.'s method with clutter label. The first three columns are directly copied from Hedau et al.^{7} *Column 4*: Our method (without clutter label). The remaining columns will be explained below.

In order to validate the effects of prior knowledge in constraining the learning process, we take out the prior knowledge by adding the two terms **E**^{a} and **E**^{c} as ordinary features and try to learn their weights. The performance of recovering box layouts in this case is shown in Table 1, column 5 (labeled "without prior"). Although the difference between columns 4 and 5 is small, there are many cases where recovering more reasonable clutter does help in recovering the correct box layout.

One typical example is shown in Figure 3. In Figure 3(a), we can see that the boundary between the *floor* and the *front-wall* (the wall on the right) is correctly recovered even though it is largely occluded by the bed, which is correctly inferred as "clutter", and the boundary is probably found by the appearance difference between the floor and the wall. However, in the model learned without prior constraints, the bed is regarded as non-clutter whereas major parts of the floor and walls are inferred as clutter (this is probably because the term **E**^{c} is not acting effectively with the learned weights), so it appears that the boundary between the *floor* and the *front-wall* is decided incorrectly due to the strong contrast between the white pillow and blue sheet.

More examples of inference result are shown in Figure 4.

In a second experiment, we fixed the latent variables ** h** to be all zeros (i.e., assuming no clutter). The results are shown in column 6 of Table 1 (labeled

We also tried to evaluate a supervised learning approach, in which we fix the latent variables *h* to be the hand-labeled clutter layout.

The results are shown in column 7 of Table 1 (labeled ** h** =

We compare our inference method (Algorithm 2) to the baseline method (of evaluating hypotheses independently) described in Section 3.2. Figure 6 shows the average pixel error rate over test set versus the number of calls (in logscale) to the joint feature mapping which is an indication of running time.

The actual running time of the inference algorithm depends on the number of random starts and convergence criteria. In our current experiments we use 50 random re-starts and run a maximum of 100 inner loop iterations. It takes on average 16 seconds to run inference for one image, 25% of which are for performing ICM on latent variables.

In Figure 7 we show the performance of the learned model on test set versus the number of iterations in learning. Empirically the learning procedure attains a small error rate after a small number of iterations, and then fluctuates around this error rate due to the approximate loss-augmented inference step of learning.

Our method is closely related to a recent work of Hedau et al.^{7}. We adopt their idea of generating box layouts from the vanishing points. However, they use supervised classification of surface labels^{10} to identify clutter (furniture), and use the trained surface label classifier to iteratively refine the box layout estimation. Specifically, in each iteration they use the estimated box layout to add features to supervised surface label classification, and then use the classification result to lower the contribution of "clutter" image regions in estimating the box layout. Thus, their method requires the user to label clutter regions in the training set.

Latent variables have been exploited in the computer vision literature in various tasks such as object detection, recognition and segmentation. They can be used to represent visual concepts such as occlusion,^{18} object parts,^{3} and image-specific color models.^{15} Introducing latent variables into structured prediction was shown to be effective in several applications.^{20} An interesting aspect of our work is that latent variables are used in direct correspondence with a concrete visual concept (clutter in the room).

Since the publication of our initial work,^{19} reasoning about the 3D geometry and semantics of scenes has been further addressed in many recent works such as Geiger et al.,^{4} Gupta et al.,^{6} Hedau et al.,^{8} Lee et al.,^{13} Pepik et al.,^{14} and Tsai et al.^{16} to name a few, which have demonstrated obtaining more fine-grained 3D context information or using it to help other vision tasks such as object detection.

In this paper we addressed the problem of recovering the geometric structure as well as clutter layout from a single image. We used latent variables to account for indoor clutter, and introduced prior terms to define the role of latent variables and to constrain the learning process. Our approach, without using clutter labels in training, outperforms a baseline method that does use them.

This improvement can be attributed to three main technical contributions: (i) we introduce latent variables and the prior terms to account for the clutter in a principled manner; (ii) we design a rich set of joint features to capture the compatibility between image and the box-clutter layouts; and (iii) we perform more efficient and accurate inference by making use of the parametrization of the "box" space.

Learning latent variable based models is known to be susceptible to local optima of the learning objective. A lesson from this paper that we believe could be useful is that, imposing prior knowledge on how the model should function (using fixed prior energy terms in our case) can help guide the learning process to a desirable region of the parameter space, and thereby give rise to a better model.

This work was supported by the National Science Foundation under Grant No. RI-0917151, the Office of Naval Research under the MURI program (N000140710747) and the Boeing Corporation.

1. Besag, J. On the statistical analysis of dirty pictures. *J. Roy. Stat. Soc. B-48*, 3 (1986), 259302.

2. Comaniciu, D., Meer, P. Mean shift: A robust approach toward feature space analysis. *IEEE Trans. Pattern Anal. Mach. Intell. 24*, 5 (2002), 603619.

3. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D. Object detection with discriminatively trained part-based models. *IEEE Trans. Pattern Anal. Mach. Intell. 32*, 9 (2010), 16271645.

4. Geiger, A., Wojek, C., Urtasun, R. Joint 3D estimation of objects and scene layout. In *NIPS* (2011), 14671475.

5. Gould, S., Fulton, R., Koller, D. Decomposing a scene into geometric and semantically consistent regions. In *Proceedings of the International Conference on Computer Vision (ICCV)* (2009).

6. Gupta, A., Satkin, S., Efros, A.A., Hebert, M. From 3D scene geometry to human workspace. In *Computer Vision and Pattern Recognition (CVPR)* (2011).

7. Hedau, V., Hoiem, D., Forsyth, D. Recovering the spatial layout of cluttered rooms. In *ICCV* (2009).

8. Hedau, V., Hoiem, D., Forsyth, D. Thinking inside the box: Using appearance models and context based on room geometry. *Computer Vision - ECCV 2010*, K. Daniilidis, P. Maragos, and N. Paragios, eds. Volume 6316 of *Lecture Notes in Computer Science*. (2010), Springer, 224237.

9. Heitz, G., Koller, D. Learning spatial context: Using stuff to find things. In *ECCV* (2008), I: 3043.

10. Hoiem, D., Efros, A.A., Hebert, M. Recovering surface layout from an image. *Int. J. Comput. Vis. 75*, 1 (Oct. 2007), 151172.

11. Hoiem, D., Efros, A.A., Hebert, M. Closing the loop in scene interpretation. In *CVPR* (2008), 18.

12. Joachims, T., Finley, T., Yu, C.N.J. Cutting-plane training of structural SVMs. *Mach. Learn. 77*, 1 (2009), 2759.

13. Lee, D., Gupta, A., Hebert, M., Kanade, T. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In *Advances in Neural Information Processing Systems* (2010), J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, eds. Volume 23, 12881296.

14. Pepik, B., Stark, M., Gehler, P., Schiele, B. Teaching 3d geometry to deformable part models. In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)* (2012).

15. Shotton, J.D.J., Winn, J., Rother, C., Criminisi, A. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. *Int. J. Comput. Vis. 81*, 1 (Jan. 2009), 223.

16. Tsai, G., Xu, C., Liu, J., Kuipers, B. Real-time indoor scene understanding using Bayesian filtering with motion cues. In *ICCV* (2011), IEEE, 121128.

17. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y. Large margin methods for structured and interdependent output variables. *JMLR 6* (2005), 14531484.

18. Vedaldi, A., Zisserman, A. Structured output regression for detection with partial truncation. In *Advances in Neural Information Processing Systems* (2009).

19. Wang, H., Gould, S., koller, D. Discriminative learning with latent variables for cluttered indoor scene understanding. In *Computer Vision - ECCV 2010*, K. Daniilidis, P. Maragos, and N. Paragios, eds. Volume 6314 of *Lecture Notes in Computer Science* (2010), Springer, 497510.

20. Yu, C.N.J., Joachims, T. Learning structural svms with latent variables. In *ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning* (2009), ACM, New York, NY, USA, 11691176.

* This work was done when S. Gould was a Ph.D. candidate at Stanford University.

The original version of this paper was published in *the 11th European Conference on Computer Vision, Part II, LNCS 6312 Proceedings*, Springer.

Figure 1. Output of our method. First row: the inferred box layout illustrated by red lines representing face boundaries. Second row: the inferred clutter layout.

Figure 2. Lower-left: Three groups of lines (shown in R, G, B) corresponding to the three vanishing points. There are also "outlier" lines (shown in yellow). Upper-left: A candidate box layout is generated. Right: Different candidate box layouts (in yellow frames) are generated in the same way, and the hand-labeled true box layout (in green frame).

Figure 3. Example result of recovering box and clutter layout. The clutter layouts are shown by removing all non-clutter segments. (a) Inferred box layout using model learned *with* prior knowledge. (b) Inferred clutter layout using model learned *with* prior knowledge. (c) Inferred box layout using model learned *without* prior knowledge. (d) Inferred clutter layout using model learned *without* prior knowledge.

Figure 4. More results for comparing learning with and without prior constraints. The 1st and 2nd columns are the result of learning with prior constraints. The 3rd and 4th columns are the result of learning without prior constraints. In many cases, recovering more reasonable clutter does help in recovering the correct box layout.

Figure 5. Sample results for comparing the recovered clutter by our method and the hand-labeled clutter in the dataset. The 1st and 2nd columns are recovered box and clutter layouts by our method. The 3rd column (right) is the hand-labeled clutter layouts. Our method usually recovers more objects as "clutter" than people would bother to delineate by hand. For example, the rug with a different appearance from the floor in the 2nd image, paintings on the wall in the 1st, 4th, 5th, and 6th images, and the tree in the 5th image. There are also major pieces of furniture that are missing in the hand-labels but recovered by our method, such as the cabinet and TV in the 1st image, everything in the 3rd image, and the small sofa in the 5th image.

Figure 6. Comparison between the inference method described in Algorithm 2 and the baseline inference method that evaluates hypotheses independently.

**©2013 ACM 0001-0782/13/04**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.

No entries found