Sign In

Communications of the ACM

Contributed articles

Unveiling Unexpected Training Data in Internet Video

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
multiple video screens, illustration

Credit: Getty Images

One of the most important components of training machine-learning models is data. The amount of training data, how clean it is, its diversity, how well it reflects the real world—all can have a dramatic effect on the performance of a trained model. Hence, collecting new datasets and finding reliable sources of supervision have become imperative for advancing the state of the art in many computer vision and graphics tasks, which have become highly dependent on machine learning. However, collecting such data at scale remains a fundamental challenge.

In this article, we focus on an intriguing source of training data—online videos. The Web hosts a large volume of video content, spanning an enormous space of real-world visual and auditory signals. The existence of such abundant data suggests the following question: How can we imbue machines with visual knowledge by directly observing the world through raw video? There are a number of challenges faced in exploring this question.

First, while the scale of data is useful for building diverse training datasets, the problem becomes one of curation—characterizing the type of videos suited for the task at hand and automating the process of filtering irrelevant and noisy videos at scale. Second, raw videos come with no annotations or labels, except possibly noisy tags and descriptions, and so deriving reliable supervision signals for a given machine-learning task is a fundamental challenge.

For many problems, human labels are the most accurate source of supervision. Indeed, there have been ongoing efforts to generate an ImageNet-equivalent dataset for videos—large-scale, real-world video datasets with ground truth annotations. Such datasets have been mostly generated by collecting videos and manually or semi-manually gathering accurate, human-labeled annotations for various tasks. Examples include activity/action-recognition datasets25,40 and video classification datasets.1,24 Several datasets also contain more complex manual annotations, such as hand contact between objects in a video,18 spatially localized labels such as bounding boxes,22,36 dense object segmentation maps,6,35 and some feature audio-based labels.20 Such human-annotated video datasets are continually growing in number, size, and richness of labels and have been widely used in the research community. However, collecting and annotating natural videos is extremely challenging, requiring great effort to devise dedicated, interactive annotation tools and to perform careful analysis of the precision and quality of the labels.

Moreover, for many tasks, manual annotation may not be feasible. For example, annotating accurate depth maps from a video or ensuring sub-pixel optical flow between two frames in a video, are difficult if not impossible tasks for humans. Synthetic data can address some of these limitations by giving full control over data generation. But this approach assumes access to realistic 3D models of a wide range of scenes, and also requires learned models to generalize from synthetic data to real scenes—an uncertain proposition.

The key idea of our work, called looking-to-listen, is to use visual signals to help process the audio signal.

In this article, we focus on a different route—learning from videos in a self-supervised manner, that is, without any human labels. In particular, we show that in some cases, and sometimes unexpectedly, certain types of such raw videos can unveil powerful training signals that fit directly with a specific task. Nevertheless, determining the type of videos needed and automatically deriving such supervision signals is often non-obvious and challenging.

Our article highlights three of our recent papers that tackle such challenges for three distinct computer vision and graphics tasks: (1) taking a pair of images of a scene and predicting 3D geometry for synthesizing new views of that scene; (2) predicting dense depth maps from video in challenging scenarios, where both the camera and the people in the scene are freely moving; and (3) an audiovisual speech separation model that takes an ordinary video as input and isolates and enhances the speech of a particular speaker while suppressing all other sounds. Each of these works involves discovery of powerful supervision signals in raw videos and shows insightful creation of new datasets via clever automatic video curation and processing algorithms. The models trained on these new datasets achieve state of the art results and have been successfully applied to real-world test scenarios.

Related work. The work highlighted in this article falls under the umbrella of self-supervised methods that learn from unlabeled video. Such methods use training signals that are either readily available in the videos or can be fully and automatically computed from the video data. For example, video frames have been used as supervision for learning correspondences, for tracking objects,45,46 and for various synthesis tasks, such as generating future frames in a video26,43 or frame interpolation.29

But the bulk of self-supervised methods do not obtain direct supervision for the task at hand, but rather supervise an auxiliary task, which in turn allows the model to learn useful video representations.2,11,16,30,38,44,46,47 For example, learning the temporal ordering of frames,16,30 or learning the arrow of time of a video47 (that is, whether a given video is playing forward or backward), allows systems to learn useful representations for action recognition. Despite the rapid progress of such methods, their performance is still inferior compared with supervised methods.

In this article, we review work that can mine Internet video for direct supervision for the task at hand but in a fully automatic manner; that is, the model is trained by directly regressing to the desired unknowns. This approach—deriving "labels" in a self-supervised manner yet training the model in a supervised manner—allows us to achieve state of the art results for various complex tasks.

Back to Top

Learning to Estimate 3D Geometry from Real-Estate Footage

When our 3D world is projected onto 2D images, geometric information is lost. The 3D position of objects in the scene, their 3D structure, or even their depth ordering are unknown. We show how Internet video can be of unexpected use for predicting the lost 3D geometric information from 2D image data.

For instance, consider the problem of computing a depth map from two images of a scene, as illustrated in Figure 1. Normally, if we wanted to apply supervised learning to this task, we would need to collect a dataset of images with their corresponding ground-truth depth maps, for instance, by taking a Microsoft Kinect sensor and scanning a large number of scenes.12 However, such data collection is cumbersome and limited—for instance, Kinect sensors produce noisy, incomplete depth maps and do not work outdoors. However, if we change our perspective and make creative use of existing data, we can find surprisingly useful sources of geometric supervision from real-world online video.

Figure 1. Computing a depth map for a static scene from two images. Left: a stereo pair of a still life. Right, a depth map computed for this scene (warmer colors represent nearer points, and cooler colors further points).

In particular, one application of geometry estimation from images is view synthesis—taking a set of known images of a scene and synthesizing new, unobserved views of the same scene with quality suitable for computer graphics applications such as virtual reality.

We can formulate this view synthesis problem as a machine-learning problem as follows: given three images of a static scene, each from a different and known viewpoint, we select two of the images and use them as input to a deep neural-based model. We then ask the model to predict the 3D geometry of that scene (for example, in the form of a depth map) from the input image pair, and then use that estimated geometry to render that scene from the perspective of the third camera viewpoint.

The machine-learning model is judged by how well the rendered image matches the actual image. If the predicted 3D-scene model is the output of a convolutional neural network, then we can train that network using the signal arising from that comparison with the ground-truth third image. If we have many such triplets of images across many different scenes, then we can train a network that can generalize to predict good 3D representations from images of any number of new scenes.

Hence, the problem of training such a network reduces to the problem of finding a large and diverse collection of image triplets of static scenes captured from known viewpoints. Previous work has also observed that 3D representations can be learned from imagery alone, but such work has used very relatively small amounts of data from, for example, lightfield cameras,23 or has involved proprietary data, such as Google Street View.17 Can we gather suitable data for this task from Internet videos?

At first, this task seems difficult: most online videos feature dynamic scenes (for example, with moving people), not static ones. Dynamic objects violate geometric constraints used to estimate the 3D structure of the scene, thus leading to errors and noise in the predicted geometry. However, we found that we can gather image triplets of static scenes from an unexpected type of video: real-estate footage. Typical real-estate videos feature a series of shots of indoor and outdoor scenes (the interior of a room or stairway, exterior views of a house, footage of the surrounding area, and so on). Shots typically feature smooth camera movement and little or no scene movement. Hence, we built a dataset from thousands of real-estate videos shared on the Web as a large and diverse source of multi-view training imagery.

To build this dataset, which we call RealEstate10K,51 we devised a pipeline for mining suitable clips from YouTube. This pipeline consists of four main steps:

  1. Identifying a set of candidate videos to download.
  2. Running a camera tracker on each video to both estimate an initial camera pose for each frame and to subdivide the video into distinct shots/clips.
  3. Performing a full optimization, known as bundle adjustment, to derive high-quality poses for each clip.
  4. Filtering to remove any remaining unsuitable clips.

The key component is the camera tracker—we use an algorithm called ORB-SLAM2, originally designed for robot localization from video, to estimate the pose of the moving camera for each video frame.31 However, we had to modify this camera tracker to be able to handle effects such as cuts and cross-fades that occur in YouTube videos in the wild. The output of our data-mining pipeline—3D camera poses and a sparse point cloud of the scene—is illustrated in Figure 2.

Figure 2. Illustration of the output of our camera tracking pipeline for a single video clip.

While this figure shows a single tracked camera sequence, we collected thousands of such sequences from real-estate videos at scale. For each tracked sequence, we can sample triplets as frames to train a machine-learning model to perform view synthesis, as described earlier.

Beyond the idea of collecting training data in this way, a key design decision is how we represent the 3D scene for view synthesis. One common approach is to represent the 3D scene as a depth map, or an image representing the distance between the camera and each scene point, as illustrated in Figure 1. Given an image and a depth map, one can use the 3D information in the depth map to reproject the image to new viewpoints. However, a limitation of depth maps is that they only represent foreground scene content visible in the reference view of the depth map, not hidden surfaces that appear when the camera is moved to a new view, for example, the part of the countertop behind the fruit platter in Figure 1.

Instead, in our work, we use a layered representation called a multiplane image (MPI), so-called because it brings to mind the multiplane camera invented at Walt Disney Studios and used in traditional 2D animation.48 The Disney version of a multiplane camera consists of a stack of planar transparencies arranged at different depths from a camera, each painted with content that should appear at a different depth (for example, a house at a nearby layer, and the moon in a further layer). By moving the transparencies at different speeds relative to the fixed camera, one can give the illusion of a 3D scene, similar to parallax scrolling in video games.

Our MPI scene format is a computational version of this idea, wherein we represent a scene as a set of RGB images with transparency arranged at fixed distances from a reference camera, as illustrated in Figure 3. To render the scene from a new viewpoint, we simply move each image in the MPI a corresponding amount and composite the transformed images in front-to-back order, also shown in Figure 3. MPIs are a very simple and convenient scene representation that have the advantage of being able to represent content hidden from the reference view, due to the use of multiple layers. MPIs can even handle reflective and transparent objects, and at least up to a certain amount of camera motion. MPIs are related to other layered representations used in vision and graphics, in particular the "stack of acetates" model introduced by Szeliski and Golland.41

Figure 3. The multiplane image (MPI) scene format.

Figure 4 illustrates our complete pipeline, wherein we train a deep-learning model to predict a MPI from a pair of input images using triplets of video frames as training data. We demonstrate this in an application that we call stereo magnification. The idea is that many modern cellphone cameras have two (or more) cameras that are very close together, for example, 1cm apart. From such closely spaced images, we might want to extrapolate views that are much further apart, for example, to enable a larger head motion in a virtual reality (VR) setting, or to create a stereo pair with the correct eye distance for viewing in 3D glasses. We successfully train a machine-learning model for this task and, even though we train from real-estate footage, we find that our model generalizes well to many other kinds of scenes. Please see our project Web pagea for videos showing continuous view interpolation and extrapolation from two input frames.

Figure 4. Our stereo magnification framework.

While our model generalizes beyond real-estate scenes, one key assumption is that it assumes scenes are static, and a corresponding crucial challenge is scenes with moving objects, and, in particular, people. Estimating 3D information from multiple views of a dynamic scene poses additional challenges, which we address next using another surprising source of data.

Back to Top

Learning the Depth of Moving People by Watching Frozen People

We have shown how online videos of static scenes captured by a moving camera can be processed and leveraged to model the geometry of static scenes via a dedicated, learning-based, view-synthesis framework. We now show that by using a specific type of similar video (static scenes, moving cameras), we can tackle a particularly challenging task—estimating the geometry of dynamic scenes from ordinary videos, that is, when both the camera and the objects in the scene are freely moving. Most existing 3D-reconstruction algorithms assume the same object can be observed from at least two different viewpoints at the same time, which allows to compute the 2D position of points using triangulation (see Figure 5). This assumption is violated by dynamic objects when captured by a moving camera. As a result, most existing methods either filter out moving objects (assigning them "zero" depth values) or ignore them (resulting in incorrect depth values). Our approach is to avoid imposing such geometric constraints by instead learning geometric priors about the shape and motion of dynamic objects from data.

Figure 5. Left: The traditional stereo setup assumes that at least two viewpoints capture the scene at the same time, and hence the 3D position of points can be computed using triangulation. Right: We consider the setup where both the camera and subject are moving, in which case triangulation is no longer possible since the so-called epipolar constraint does not apply.

While there has been a recent surge in the development of learning-based models for predicting geometry (for example, depth maps) from imagery, most existing methods consider only a single image as input (RGB-to-Depth) or are restricted to static scenes (as with the method we presented earlier). We extend this line of research to predicting geometry of dynamic objects from ordinary videos. More specifically, we consider the problem of predicting dense depth maps from ordinary videos when both the camera and the people in the scene are naturally moving.28 (see Figure 6). We focus on humans because they are an interesting subject for augmented reality applications and 3D video effects. Furthermore, human motion is articulated and difficult to model, making them an important challenge to address.

Figure 6. Learning the depth of moving people by watching frozen people.

MannequinChallenge Dataset. Where do we get the data needed to train a depth prediction model that can handle moving people in the scene captured by a single moving camera? Generating high-quality synthetic data in which both the camera and the people in the scene are naturally moving is very challenging. Depth sensors (for example, Kinect) can provide useful data but are typically limited to indoor environments and require significant manual work in capture.

Instead, we derive training data from a surprising source: a category of video in which people freeze in place—often in interesting poses—while the camera operator moves around the scene filming them—attempting the so-called "Mannequin Challenge."49 Many such videos have been created and uploaded since late 2016, and these videos span a wide range of scenes with people of different ages, naturally posing in different group configurations. These videos comprise our new MannequinChallenge (MC) Dataset, which we recently released to the research community.27

To the extent that people succeed in staying still during the videos, these videos are no different from the real-estate videos discussed earlier—we can assume the scenes are static while the camera is moving, in which case multi-view geometric constraints and triangulation-based methods apply. We can then obtain accurate camera poses using the same camera-tracking pipeline we described previously. We can then obtain accurate depth information through further processing with vision methods known as Multi-View-Stereo (MVS). We illustrate such automatically derived depth data in Figure 7.

Figure 7. MannequinChallenge Dataset: (a) Each example is a frame from a MannequinChallenge video sequence in which the camera is moving but all humans are static. Because the entire scene is static, these videos span a variety of natural scenes, poses, and configurations of people.

However, recovering accurate geometry from such raw video is challenging. First, there are videos that are not suitable for training. For example, people may "unfreeze" (start moving) at some point in the video, or the video may contain synthetic graphical elements in the background. Second, such in-the-wild videos often involve camera motion blur, shadows, or reflections. Thus, the raw depth maps estimated by MVS are often too noisy for use in training. To address these challenges, we developed an automatic framework for carefully filtering noisy video clips and individual depth values within frames in each clip (full details are described in Li.28 This filtering is a crucial step in generating accurate, reliable supervision signals from raw video data.

Inferring the depth of moving people. Our data provides depth supervision for a moving camera and "frozen" people, but our goal is to handle videos with a moving camera and moving people. We need a machine-learning model that can bridge this gap.

One approach would be to infer depth separately for each frame of the video (such as RBG-to-Depth). We tried this, and while such a model already improves over state of the art single-image methods for depth prediction, this approach disregards depth information about the rigid (static) parts of scenes that can be inferred when considering more than a single frame. To benefit from such information, we design a two-frame model that uses depth information computed from motion parallax, that is, the relative apparent motion of static objects between two different viewpoints. In particular, we first compute 2D-optical flow between each input frame and another frame in the video. This flow field depends on both the scene's depth and the relative position of the camera. However, because the camera positions are known, we can remove their dependency from the flow field, which results in an initial depth map.

At test time, since people are moving, the computed depth map would be incorrect in the human regions. We therefore segment and mask out those regions and only supply to the network depth information for the static environment, as illustrated in Figure 8. The network's job is to "inpaint" the depth values for the regions with people and refine the depth elsewhere. Our two-frame model leads to a significant improvement over the RGB-only model for both human and non-human regions.

Figure 8. Our model takes as input an RGB frame, a human segmentation mask, masked depth computed from motion parallax (via optical flow and SfM pose), and an associated confidence map. We ask the network to use these inputs to predict depths that match the ground-truth MVS depth.

Depth prediction results. In Figure 9, we show some examples of our depth-prediction model results on real videos, with comparison to recent state-of-the-art learning-based methods. Our depth maps are significantly more accurate and more consistent over time. Armed with the estimated depth maps, we can produce a range of 3D-aware video effects, including synthetic depth defocusing, generating a stereo video from a monocular one, and inserting synthetic computer-generated (CG) objects into the scene. Our depth maps also provide the ability to fill in holes and discolored regions with the content exposed in other frames of the video. Please see our web-page for a full set of results.b

Figure 9. Depth-prediction results on video clips with moving cameras and people.

Back to Top

Looking-to-Listen: Audio-Visual Source-Separation Model

In the previous two sections, we showed how raw online video can provide powerful visual signals that can be used as training data for complex visual tasks. Here, we go beyond visual signals by also leveraging auditory signals found in ordinary video. More specifically, our goal is to tackle the cocktail party problem—isolating and enhancing a single voice of a desired speaker from a mixture of sounds, such as background noise and other speakers.9 Humans can do this very well—we have a remarkable ability to focus our auditory attention on a particular speaker while filtering out all other voices and sounds.9 We want to teach machines this same ability by observing auditory and visual signals in online video.

The key idea of our work, called looking-to-listen, is to use visual signals to help process the audio signal. Intuitively, facial features—such as mouth movements or even facial expressions—should correlate with the sounds produced when that person speaks, which in turn can help to identify and isolate that person's speech signal from a mixture of sounds. To do so, we design and train a joint audiovisual model, where the input to the model is an ordinary video (frames + audio track), and the output is clean speech tracks, one for each person detected in the video. This is the first audio-visual, speaker-independent separation model; that is, the model is trained only once and then can be applied to any speaker at test time.

For many cross-modal tasks, the natural co-occurrence of audio and visual signals in video can readily provide supervision. Examples include learning audio-video representations,3,5,7 cross-modal retrieval,32,33,39 or sound source localization.3,32,37,50 However, in our case, in order to train our model in a supervised manner, we need regular videos with mixed speech and background noise as input—and also ground-truth separated audio tracks for each of the speakers as supervision. Existing video does not provide such supervision and directly recording it at scale would have been difficult. However, we can think of ways to generate the exact training data we need from existing raw online video.

Specifically, we use online videos of talks, lectures, and how-to videos. Many of these videos contain a single, visible speaker with a clean recording of their speech and no interfering sounds. With these clean videos in hand, we can then generate training examples of "synthetic cocktail parties" by mixing clean audio tracks of different speakers and background noise, as illustrated in Figure 10. This allows us to train an audio-visual speech-separation model in a supervised manner by directly regressing to the clean audio tracks for each of the speakers.

Figure 10. AVSpeech Dataset: We first gathered a large collection of 290,000 high-quality, online public videos of talks and lectures (a). Using audio and video processing, we extracted video segments with clean speech (no mixed music, audience sounds, or other speakers), and with the speaker visible in the frame (b). This resulted in 4,700 hours of video clips, spanning a wide variety of people, languages, and face poses. Each segment contains a single person talking with no background interference. From these clean segments, we then generate training examples of "synthetic cocktail parties" by mixing the audio tracks of different speakers (c).

AVSpeech dataset. We collected a large amount of high-quality video clips, each containing a single, visible speaker with a clean recording of his or her speech and no other interfering background sounds. To do so, we started by crawling around 290,000 candidate videos from YouTube channels of lectures, TED Talks, and how-to videos. However, as discussed previously, raw video data never provides clean, perfectly accurate training data. In this case, significant parts of such videos may not be suitable for training, either because of the visual content (for example, shots of the audience, slides, or other visuals that do not include the speaker), or because of the audio content (for example, noisy speech). Avoiding such cases and assembling a large reliable corpus of data is a crucial step that calls for automatic processing that does not rely on human feedback. We achieve that by designing a dedicated filtering mechanism based on both video and audio processing, as described in detail in Ephrat.15

After filtering, we obtain roughly 4,700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages, and face poses. The dataset is available for academic use.14 From these videos, training examples of synthetic cocktail parties are generated by mixing clean audio tracks of different speakers and background noise from the AudioSet dataset,21 as illustrated in Figure 4 (b-c).

Audio-Visual Speech Separation model. With the AVSpeech dataset in hand, we design and train a model to decompose the synthetic cocktail mixture into clean audio streams for each speaker in the video, as illustrated in Figure 11.

Figure 11. Audio-visual speech-separation model: We start by detecting and tracking talking people in an input video and compute a feature (face embedding) for each of the face thumbnails detected in each frame.

The model takes both visual and auditory features as input. For the visual features, we only consider the face region by first detecting all the faces in each frame using an off-the-shelf face detector (for example, the Google Cloud Vision API). For each of the detected face thumbnails, we compute a visual feature. This is done by feeding each face thumbnail to a pre-trained face recognition model and extracting the features from the lowest layer in the network, similar to the one used by Cole.10 The rationale is that these features retain information necessary for recognizing millions of faces, while discarding irrelevant variation between images, such as illumination. For audio features, we use complex spectograms computed by the short-time Fourier transform (STFT) of three-second audio segments.

Our model first processes the visual and auditory signals separately and then fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated, clean speech signal for each speaker. The model also outputs one mask for the background interference.

During training, the squared L2 error between the clean spectrogram and the predicted spectrogram is used as a loss function to train the network. At inference time, our separation model can be applied to arbitrarily long segments of video and varying numbers of speakers. The latter is achieved by either directly training the model with multiple-input visual streams (one for speaker), or simply by feeding the visual features of the desired speaker to the visual stream. For full details about the architecture and training process, see our full paper.15

Speech separation results. Once trained, our model can be applied to real-world videos with arbitrary speakers. Figure 12 shows representative frames from an assortment of such videos containing heated debates and interviews, noisy bars, and screaming children. For all these challenging videos, our model successfully isolates and enhances the speech of the desired speaker, while suppressing all other sounds. See our project page for a full set of results.c

Figure 12. Speech separation in the wild: Representative frames from natural videos demonstrate our method in various real-world scenarios. All videos and results can be found in the project webpage. The "Undisputed Interview" video is courtesy of Fox Sports.

The "Double Brady" video is a synthetic example, in which we concatenated two different segments from the same video side by side. This is an extremely challenging case because the two speakers are identical (same voice, same appearance), only time delayed. Our audio-visual model successfully achieves a clean separation result in this case and significantly outperforms a state-of-the-art audio-only model. This highlights the utilization of visual information by our model. See our paper for a thorough numerical evaluation and comparison to a state-of-the-art audio-only model.15

Internet video can be of unexpected use for predicting the lost 3D geometric information from 2D image data.

This technology has recently launched in YouTube Stories, a sharing platform for short, mobile-only videos. Many of these videos are selfies taken in noisy locations (for example, parties or sporting events) with low-cost cellphone microphones. Our looking-to-listen technology now allows creators to isolate their speech from all other voices and sounds.

Back to Top

Conclusion and Future Work

The contents of online videos depict a wide array of phenomena that can be used to teach machines about our world. With such a rich resource available, part of the creative pursuit of research today involves identifying interesting types of information that can be automatically derived or synthesized from video and devising methods for identifying and processing such data.

While the visual content provided by raw video is enormous, it does not always perfectly represent our true visual world. For example, videos of some actions (for example, "brushing teeth") may be difficult to find, even though they are performed daily by billions of people.15 Nevertheless, even if just a small fraction of videos is suitable for a given task, the sheer quantity of data allows for the creation of novel, real-world datasets—a key intellectual question then becomes finding and leveraging these needles in haystacks for specific tasks.

In this article, we reviewed several recent research efforts that use this insight to automatically source training data from noisy, raw online video for computer vision applications. However, this work is but an initial glimpse into a universe of possibility. We envision a wide spectrum of visual signals that can be automatically derived from Internet video and can be used to teach machines about our world.

For example, we can envision training a model to estimate illumination by identifying videos of the same location taken at different times of day. On the theme of finding scenes under multiple illuminations, prior work in graphics and vision shows the power of reasoning about pairs of images of a scene with and without camera flash.13,34 Can we mine such pairs of frames automatically from video of events like red carpet galas, and learn about shape and appearance?

Another example of finding needles in a haystack is identifying instances of known objects, such a specific kind of soda can where all instances have the same shape and material properties. Such objects can be thought of as accidental "light probes" that can reveal aspects of the incoming illumination, and thus provide training signals for scene structure, material properties, and illumination.

Ultimately, we hope that training data derived from raw videos will be expanded to robotics and autonomous navigation, where machines must operate in a rich variety of scenes and perform a wide variety of tasks but explicit training data is in limited supply.

Back to Top


The work highlighted in this article has been done in collaboration with Zhengqi Li, Forrester Cole, Richard Tucker, Ariel Ephrat, Inbar Mosseri, Oran Lang, Kevin Wilson, Avinatan Hassidim, Michael Rubinstein, Ce Liu, and William T. Freeman.

Back to Top


1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. YouTube-8m: A large-scale video classification benchmark (2016); arXiv:1609.08675.

2. Agrawal, P., Carreira, J., and Malik, J. Learning to see by moving. In Proc. of the 2015 Int. Conf. on Computer Vision, 37–45.

3. Arandjelovic, R. and Zisserman, A. Look, listen and learn. In Proc. of the 2017 Int. Conf. on ComputerVision, 609-617.

4. Arandjelovic, R. and Zisserman, A. Objects that sound. In Proc. of the 2018 European Conf. on Computer Vision, 435–451.

5. Aytar, Y., Vondrick, C., and Torralba, A. SoundNet: Learning sound representations from unlabeled video. Neural Information Processing Systems (2016), 892–900.

6. Caelles, S., Montes, A., Maninis, K-K., Chen, Y., Van Gool, L., Perazzi, F., and Pont-Tuset, J. The 2018 DAVIS Challenge on Video Object Segmentation; arXiv:1803.00557.

7. Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., and Torralba, A. Learning aligned cross-modal representations from weakly aligned data. In Proc. of the 2016 Conf. Computer Vision and Pattern Recognition.

8. Chen, W., Fu, Z., Yang, D., and Deng, J. Single-image depth perception in the wild. Neural Information Processing Systems (2016), 730–738.

9. Cherry, E.C. Some experiments on the recognition of speech, with one and with two ears. The J. Acoustical Society of America (1953).

10. Cole, F., Belanger, D., Krishnan, D., Sarna, A., Mosseri, I., and Freeman, W.T. Synthesizing normalized faces from facial identity features. In Proc. of the 2017 Conf. Computer Vision and Pattern Recognition.

11. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., and Zisserman, A. Temporal cycle-consistency learning. In Proc. of the 2019 Conf. Computer Vision and Pattern Recognition.

12. Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Neural Information Processing Systems (2014), 2366–2374.

13. Eisemann, E. and Durand, F. Flash photography enhancement via intrinsic relighting. ACM Trans. Graphics (2004).

14. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., and Rubinstein, M. AVSpeech Dataset (2018);

15. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., and Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graphics 37, 4 (2018), 112.

16. Fernando, B., Bilen, H., Gavves, E., and Gould, S. Self-supervised video representation learning with odd-one-out networks. In Proc. of the 2017 Computer Vision and Pattern Recognition, 3636–3645.

17. Flynn, J., Neulander, I., Philbin, J., and Snavely, N. DeepStereo: Learning to predict new views from the world's imagery. In Proc. of the 2016 Conf. Computer Vision and Pattern Recognition.

18. Fouhey, D.F., Kuo, W., Efros, A.A., and Malik, J. From lifestyle vlogs to everyday interactions. In Proc. of the 2018 Conf. Computer Vision and Pattern Recognition, 4991–5000.

19. Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. Deep ordinal regression network for monocular depth estimation. In Proc. of the 2018 Conf. Computer Vision and Pattern Recognition.

20. Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., and Ritter, M. AudioSet: An ontology and human-labeled dataset for audio events. In Proc. of the 2017 ICASSP, 776–780.

21. Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., and Ritter, M. AudioSet: An ontology and human-labeled dataset for audio events. In Proc. of the 2017 ICASSP.

22. Gu, C. et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proc. of the 2018 Conf. Computer Vision and Pattern Recognition, 6047–6056.

23. Kalantari, N.K., Wang, T-C., and Ramamoorthi, R. Learning-based view synthesis for light field cameras. SIGGRAPH ASIA 2016.

24. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proc. of the 2014 Conf. Computer Vision and Pattern Recognition.

25. Kay, W. et al. The kinetics human action video dataset (2017); arXiv:1705.06950.

26. Kwon, Y-H and Park, M-G. Predicting future frames using retrospective cycle GAN. In Proc. 2019 IEEE Conf. Computer Vision and Pattern Recognition, 1811–1820.

27. Li, Z., Dekel, T., Cole, F., Tucker, R., and Snavely, N. MannequinChallenge Dataset (2019);

28. Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., and Freeman, W.T. Learning the depths of moving people by watching frozen people. In Proc. of the 2019 Conf. Computer Vision and Pattern Recognition, 4521–4530.

29. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., and Agarwala, A. Video frame synthesis using deep voxel flow. In Proc. of the 2017 IEEE Intern. Conf. on Computer Vision, 4463–4471.

30. Misra, I., Zitnick, C.L., and Hebert, M. Shuffle and learn: Unsupervised learning using temporal order verification. In Proc. European Conf. on Computer Vision. Springer (2016), 527–544.

31. Mur-Artal, M.J.M.M., Tardós, R., and Tardós, J.D. ORBSLAM: A versatile and accurate monocular SLAM system. IEEE Trans. on Robotics 31, 5 (2015).

32. Owens, A. and Efros, A.A. Audio-visual scene analysis with self-supervised multisensory features. In Proc. of the 2018 European Conf. on Computer Vision.

33. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., and Freeman, W.T. Visually indicated sounds. In Proc. of the 2016 Conf. Computer Vision and Pattern Recognition, 2405–2413.

34. Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M., Hoppe, H., and Toyama, K. Digital photography with flash and no-flash image pairs. ACM Trans. Graphics, 2004.

35. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., and Van Gool, L. The 2017 DAVIS challenge on video object segmentation; arXiv:1704.00675.

36. Real, E., Shlens, J., Mazzocchi, S., Pan, X., and Vanhoucke, V. YouTube-bounding boxes: A large high-precision human-annotated data set for object detection in video. In Proc. of the 2017 Conf. Computer Vision and Pattern Recognition, 5296–5305.

37. Senocak, A., Oh, T-H., Kim, J., Yang, M-H., and Kweon, I.S. Learning to localize sound source in visual scenes. In Proc. of the 2018 Conf. Computer Vision and Pattern Recognition.

38. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., and Levine, S. Time-contrastive networks: Self-supervised learning from video. In Proc. of the 2018 IEEE Intern. Conf. Robotics and Automation.

39. Soler, M., Bazin, J-C., Wang, O., Krause, A., and Sorkine-Hornung, A. Suggesting sounds for images from video collections. In Proc. of the 2016 European Conf. on Computer Vision. Springer, 900–917.

40. Soomro, K., Zamir, A.R., and Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild (2012); arXiv:1212.0402.

41. Szeliski, R. and Golland, P. Stereo matching with transparency and matting. Int. J. of Computer Vision (1999).

42. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., and Brox, T. DeMoN: Depth and motion network for learning monocular stereo. In Proc. of the 2017 Conf. Computer Vision and Pattern Recognition.

43. Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene dynamics. Advances in Neural Information Processing Systems (2016), 613–621.

44. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., and Murphy, K. Tracking emerges by colorizing videos. In Proc. of the 2018 European Conf. on Computer Vision, 391–408.

45. Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., and Li, H. Unsupervised deep tracking. In Proc. of the 2019 Conf. Computer Vision and Pattern Recognition, 1308–1317.

46. Wang, X., Jabri, A., and Efros, A.A. Learning correspondence from the cycle-consistency of time. In Proc. of the 2019 Conf. Computer Vision and Pattern Recognition.

47. Wei, D., Lim, J.J., Zisserman, A., and Freeman, W.T. Learning and using the arrow of time. In Proc. of the 2018 Conf. Computer Vision and Pattern Recognition, 8052–8060.

48. Wikipedia. Multiplane camera, 2017;

49. Wikipedia. Mannequin Challenge, 2018;

50. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A. The sound of pixels. In Proc. of the 2018 European Conf. on Computer Vision, 570–586.

51. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N. Stereo magnification: Learning view synthesis using multiplane images. ACM Trans. Graphics 37, 4 (2018), 65.

Back to Top


Tali Dekel is a research scientist at Google and an assistant professor on the faculty of Mathematics and Computer Science at the Weizmann Institute of Science, Rehovot, Israel.

Noah Snavely works at Google Research and is an associate professor in the Computer Science Department at Cornell Tech in the Cornell Graphics and Vision Group, New York, NY, USA.

Back to Top





cacm_ccby.gif This work is licensed under a

The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.


No entries found