We present an approach for generating face animations from large image collections of the same person. Such collections, which we call photobios, are remarkable in that they summarize a person's life in photos; the photos sample the appearance of a person over changes in age, pose, facial expression, hairstyle, and other variations. Yet, browsing and exploring photobios is infeasible due to their large volume. By optimizing the quantity and order in which photos are displayed and cross dissolving between them, we can render smooth transitions between face pose (e.g., from frowning to smiling), and create moving portraits from collections of still photos. Used in this context, the cross dissolve produces a very strong motion effect; a key contribution of the paper is to explain this effect and analyze its operating range. We demonstrate results on a variety of datasets including time-lapse photography, personal photo collections, and images of celebrities downloaded from the Internet. Our approach is completely automatic and has been widely deployed as the "Face Movies" feature in Google's Picasa.
People are photographed thousands of times over their lifetimes. Taken together, the photos of each person form his or her visual record. Such a visual record, which we call a photobio, samples the appearance space of that individual over time, capturing variations in facial expression, pose, hairstyle, and so forth. While acquiring photobios used to be a tedious process, the advent of photo sharing tools like Facebook coupled with face recognition technology and image search is making it easier to amass huge numbers of photos of friends, family, and celebrities. As this trend increases, we will have access to increasingly complete photobios. The large volume of such collections, however, makes them very difficult to manage, and better tools are needed for browsing, exploring, and rendering them.
If we could capture every expression that a person makes, from every pose and viewing/lighting condition, and at every point in their life, we could describe the complete appearance space of that individual. Given such a representation, we could render any view of that person on demand, in a similar manner to how a lightfield18 enables visualizing a static scene. However, key challenges are (1) the face appearance space is extremely high dimensional, (2) we generally have access to only a sparse sampling of this space, and (3) the mapping of each image to pose, expression, and other parameters is not generally known a priori. In this paper, we take a step toward addressing these problems to create animated viewing experiences from a person's photobio.
A key insight in our work is that cross dissolving well-aligned images produces a very strong motion sensation. While the cross dissolve (also known as cross fade, or linear intensity blend) is prevalent in morphing and image-based-rendering techniques (e.g., Chen and Williams,7 Levoy and Hanrahan,18 Seitz and Dyer22), it is usually used in tandem with a geometric warp, the latter requiring accurate pixel correspondence (i.e., optical flow) between the source images. Surprisingly, the cross dissolve by itself (without correspondence/flow estimation) can produce a very strong sensation of movement, particularly when the input images are well aligned. We explain this effect and prove some remarkable properties of the cross dissolve. In particular, given two images of a scene with small motion between them, a cross dissolve produces a sequence in which the edges move smoothly, with nonlinear ease-in, ease-out dynamics. Furthermore, the cross dissolve can also synthesize physical illumination changes, in which the light source direction moves during the transition.
Our photobios approach takes as input an unorganized collection of photos of a person, and produces animations of the person moving continuously (Figure 1). The method operates best when the photo collection is very large (several hundred or thousand photos), but produces reasonable results for smaller collections. As a special case of interest, we first show results on time-lapse sequences, where the same person is photographed every day or week over a period of years. We then apply the technique to more standard image collections of the same person taken over many years, and also to images of celebrities downloaded from the Internet.
Our approach is based on representing the set of images in a photobio as nodes in a graph, solving for optimal paths, and rendering a stabilized transition from the resulting image sequence. The key issues therefore are (1) defining the edge weights in the graph, and (2) creating a compelling, stabilized output sequence. Our approach leverages automatic face detection, pose estimation, and robust image comparison techniques to both define the graph and create the final transitions. The pipeline is almost entirely automaticthe only step that requires manual assistance is to identify the subject of interest when an image contains multiple people. Automating this step is also likely possible using face recognition techniques.4
We note that graph-based techniques have been used for other animation tasks,2,6,10,16,25 but ours is the first to operate on unstructured photos collections and propose the moving portraits idea. Our ability to operate on unstructured photo collections is a direct result of the maturity of computer-vision-based face analysis techniques that are now in widespread use in the research community and in digital cameras, Google Streetview, Apple's iPhoto, etc. These techniques include face detection, face tagging, and recognition.
The Face Movies feature of Picasa's 3.8 release21 provides an implementation of moving portraits, targeted to personal photos. This feature leverages Picasa's built-in face recognition capabilities, and enables creating a face movie of a person with minimal effort (a single click). Deploying the approach at scale (with photo collections numbering in the tens of thousands) required a number of modifications to our basic algorithm, which we describe.
We represent the photobio as a graph (Figure 2), where each photo is a node, and edge weights encode distance (inverse similarity) between photos. Photo similarity is measured as a function of similarity in facial appearance (expression, glasses, hair, etc.), 3D pose (yaw, pitch), and time (how many hours/days apart). That is, a distance between faces i and j is defined as D(i, j):
where "app" stands for facial appearance distance, and 3D head pose distance is represented by yaw and pitch angles. Next, we define each of these similarity terms and how they are computed.
2.1. Head pose estimation and alignment
To compute face pose (yaw, pitch) distances between photos, we estimate 3D head pose as follows (illustrated in Figure 3). Each photo is preprocessed automatically by first running a face detector5 followed by a fiducial points detector9 that finds the left and right corners of each eye, the two nostrils, the tip of the nose, and the left and right corners of the mouth. We ignore photos with low detection confidence (less than 0.5 in face detection and less than 3 in detection of the fiducial points). Throughout our experiments, despite variations in lighting, pose, scale, facial expression, and identity, this combination of methods was extremely robust for near-frontal faces with displacement errors gradually increasing as the face turns to profile.
The next step is to detect pose, which is achieved by geometrically aligning each detected face region to a 3D template model of a face. We use a neutral face model from the publicly available space-time faces25 dataset for the template. We estimate a linear transformation that transforms the located fiducial points to prelabeled fiducials on the template model, and use RQ decomposition to find rotation and scale. We then estimate the yaw, pitch, and roll angles from the rotation matrix. Given the estimated pose, we transform the template shape to the orientation of the face in the image and warp the image to a frontal pose using point-set z-buffering13 to account for occlusions. This results in a roughly frontal version of the given photo.
The head pose distance between two faces is measured separately for yaw and pitch angles and each of these is normalized using a robust logistic function:
where the logistic function L(d) is defined as L(d) = 1/[1 + exp((d T)/)] with = ln(99). It normalizes the distances d to the range [0, 1] such that the value d = T is mapped to 0.5 and the values d = T ± map to 0.99 and 0.01, respectively.
2.2. Facial appearance similarity
Face appearance, measured in photo pixel values, can vary dramatically between photos of the same person, due to changes in lighting, pose, color balance, expression, hair, etc. Better invariance can be achieved by use of various descriptors instead of raw pixel colors. In particular, Local Binary Pattern (LBP) histograms have proven effective for face recognition and retrieval tasks.1, 14 LBP operates by replacing each pixel with a binary code that represents the relative brightness of each of that pixel's immediate neighbors. Each neighbor is assigned a 1 if it is brighter, or 0 if it is darker than the center pixel. This pattern of 1's and 0's for the neighborhood defines a per pixel binary code. Unlike raw RGB values, these codes are invariant to camera bias, gain, and other common intensity transformations. Additionally, some invariance to small spatial shifts is obtained by dividing the image into a grid of super-pixel cells, and assigning each cell a descriptor corresponding to the histogram of binary patterns for pixels within that cell.1
We calculate LBP descriptors on aligned faces and estimate a separate set of descriptors for the eyes, mouth, and hair regions, where a descriptor for a region is a concatenation of participating cells' descriptors. The regions and example matching results are shown in Figure 4. The distance between two face images i and j, denoted dij, is then defined by 2-distance between the corresponding descriptors. The combined appearance distance function is defined as
where dm,e,h are the LBP histogram distances restricted to the mouth, eyes, and hair regions, respectively, and m,e,h are the corresponding weights for these regions. For example, assigning m = 1 and e = h = 0 will result in only the mouth region being considered in the comparison. In our experiments, we used m = 0.8 and e = h = 0.1.
When time or date stamps are available, we augment Equation (3) with an additional term measuring L2 difference in time.
Each distance is normalized using a robust logistic function L(d).
By constructing a face graph we can now traverse paths on the graph and find smooth, continuous transitions from the still photos contained in a photobio. We find such paths either via traditional shortest path or by greedy walk algorithms on the face graph.
Given any two photos, we can find the smoothest path between them by solving for the shortest path in the face graph. We are interested in finding a path with the minimal cost (sum of distances), which is readily solved using Dijkstra's algorithm. The number of in-between images is controlled by raising the distance to some power: D(i, j). The exponent is used to nonlinearly scale the distances, and provides additional control of step size in the path planning process.
Given any starting point, we can also produce a smooth path of arbitrary length by taking walks on the graph. Stepping to an adjacent node with minimal edge distance generally results in continuous transitions. There are a number of possible ways to avoid repetitions, for example, by injecting randomness. We obtained good results simply by deleting previously visited nodes from the graph (and all of their incident edges). For collections with time/date information, we encourage chronological transitions by preferentially choosing steps that go forward in time.
An important requirement for Picasa's Face Movies implementation is real-time performance, that is, the face movie should start playing almost immediately when the feature is selected. We achieved this goal through a number of optimizations.
First, we simplified the similarity descriptors. Specifically, we observed that the mouth is highly correlated with the eyes and other regions of the face in forming expressions. Hence, we found it sufficient to match only the mouth region. More surprisingly, we found that head orientation is also well correlated with the appearance of the mouth region (producing mouth foreshortening and rotation), eliminating the need to compute pose explicitly. Similarly, we found that using 2D affine image alignment rather than 3D warping produces satisfactory results at lower cost. These optimizations, as well as the use of HOG (Histogram of Oriented Gradients) features8 in place of LBP, significantly reduced matching costs.
By default, Face Movies creates and renders a greedy walk rather than an optimized path, as the former is faster to compute. We accomplish this using a multithreaded approach where one thread computes the image graph on the fly and selects the next image to step to, while the other thread renders the transition between the previous two images. For computing the greedy sequence, we first sort all the images by time, start at the oldest image, and then consider the next 100 images in the sequence, chronologically. We choose the one with the closest similarity as the next image to step to. This procedure repeats until the time window overlaps the most recent photo, at which point we reverse direction, that is, select photos going monotonically back in time. We then continue to oscillate forward and back until all images have been shown. We give preference to starred photos by reducing their edge weights so that they are more likely to be shown toward the beginning of the face movie.
The user can control the length of the movie by specifying the number of photos in the face movie and we find the optimal sequence of desired length via dynamic programming. To achieve reasonable performance (a delay of a few seconds, even for collections of tens of thousands of images), we employed additional optimizations, such as breaking the sequence into separately optimized chunks of 1000 images and sparsifying the graph by considering only 100 neighbors for each image.
Now that we have computed an optimal sequence of still photos, how can we render them as a continuous animation? Although pose-, expression-, and time-aligned, it is still a sequence of independently taken photos; creating a smooth animation requires rendering compelling transitions from one photo to the next. Morphing techniques can produce excellent transitions, but require accurate correspondence between pixels in the images, which is difficult to obtain. A simpler alternative is to use a cross dissolve. The cross dissolve or cross fade transitions between two images (or image sequences) by simultaneously fading one out while fading the other in over a short time interval. Mathematically, the cross dissolve is defined as
where Iin1 and Iin2 are the input images and Iout(t) is the output sequence.
This effect is often combined with geometric warps in morphing,3, 22 and image-based rendering methods,18 to synthesize motion between photos. More surprisingly, the cross dissolve by itself (without correspondence/flow estimation) can produce a very strong sensation of movement, particularly when the input images are well aligned. For example, Figure 5 shows a cross dissolve between two photos of a person's face, in which both the lighting and features appear to move realistically. While it makes sense that warping an image produces a motion sensation, why would motion arise from a simple intensity blend? We explain this effect and prove some remarkable properties of the cross dissolve.
We show that the cross dissolve produces not just the illusion of motion, but true motion; given two images of a scene with small motion between them, a cross dissolve produces a sequence in which image edges move smoothly, with nonlinear ease-in, ease-out dynamics. Furthermore, the cross dissolve can synthesize physical illumination changes, in which the light source direction moves during the transition. We briefly describe these effects here, for further analysis see our SIGGRAPH paper.15
5.1. Image edge motion
Images are composed of edges of different locations, orientations, and frequencies. By modeling the effects of the cross dissolve on edges, we can thereby analyze image motion in general. Image edges arise from rapid spatial changes in intensity in a still photo. Real image edge profiles tend to be smooth rather than discontinuous, due to the optical blurring effects of the imaging process.19, 20 Indeed, the convolution of a step-edge with a Gaussiana blurring kernel is the erf function: . This function is very closely approximated as a segment of a sine curve. We use this sine edge model to prove properties of the cross dissolve and then show the correlation with real image edges.
Consider two sine waves (each represents a different image) where one is a translated (and optionally amplitude-scaled) version of the other. Specifically, we consider sin(mx) and sin(mx + d) so that d is the phase shift (spatial translation) and is the amplitude scale. Cross dissolving these two sine waves produces a sequence of sine waves given as follows:
where t [0, 1] and
Therefore, cross dissolving two sines (image edges) with different phases produces a motion, where the phase k is smoothly interpolated. This simple analysis gives rise to a number of remarkable observations (Figure 6):
Note that our analysis so far is based on a periodic function (sine); however, most edges are not periodic. Periodicity is not necessary, however, as the analysis applies locally. See Kemelmacher-Shlizerman et al.15 for a more detailed analysis of the nonperiodic case.
5.2. Generalizing to 2D
Our analysis naturally generalizes to translations of 2D image edges. In case of 2D edge translation, we simply define our edge profiles in the direction normal to the edge, thus reducing to the 1D case.
We have observed that cross dissolving two edges with different orientations produces a compelling apparent rotation perception (see video for examples), particularly when the orientation change is small and edge is low frequency. This effect is explained further in Kemelmacher-Shlizerman et al.15
5.3. Interpolation of light sources
In addition to edge motion, cross dissolves can also produce very convincing illumination changes in which the light source direction appears to move realistically during the transition. Indeed, we now describe conditions under which a cross dissolve produces physically-correct illumination changes.
An image of a Lambertian object, ignoring shadows, specularities, and inter-reflections, is determined by I = lTn, where is the albedo, l is the lighting direction vector, and n is the surface normal vector. Cross dissolving two such images: (1 t)I1 + tI2, has the effect of interpolating the lighting directions ((1 t)l1 + tl2)Tn. In particular, we can rewrite the image formation equation as I = |l| cos , where is the angle between the surface normal at each point on the surface and lighting direction.
A cross dissolve of two images can be then formulated as
with d being the difference between the surface normal and the two lighting direction angles. Hence, the interpolated pixel is the sum of two shifted cosines, which is also a cosine. In this case, however, the cosine is not in the image plane, but rather defines the variation of the pixel intensity as a function of lighting. The amplitude change cl results in an effective dimming of the light during the transition, with minimum contrast occurring at the midpoint. This dimming effect serves to hide artifacts due to shadows, specular highlights, and other non-Lambertian effects that are not modeled by Equation (8). The cross dissolve thereby hides artifacts in the frames in which they are most likely to appeara remarkable property!
While we are presenting the specific light trajectory of the cross dissolve (two-image) case for the first time, we emphasize that the basic result that image interpolations produce new directional illuminations of Lambertian objects is well known in the computer vision community, going back to the work of Shashua.23
We experimented with datasets downloaded from the Internet, and with personal photo collections. Hundreds of Face Movies can also be found on YouTube, created by users of Picasa.
Most of our results are best viewed in the video (http://grail.cs.washington.edu/photobios/). We present a few example paths in Figure 7. We first experimented with time-lapse photo collections, in which a single person is photographed every week/day over a period of years, and usually include large variations in facial expression, hairstyle, etc. We show an example result on The "Daily Jason" dataset contains 1598 pictures taken almost every day during 5 years. Figure 7(a) shows an optimized paththe end points (marked in red) are chosen by the user in our interface and the intermediate sequence is computed by our method. Note the smooth transitions in mouth expression and eyes.
We have also experimented with personal photo collections: (1) 584 pictures of Amit over 5 years, (2) 1300 pictures of Ariel over 20 years, and (3) 530 photos of George W. Bush taken from the Labeled Faces in the Wild11 collection. In contrast to the time-lapse datasets, the pictures in these three datasets were taken in arbitrary events, locations, with various illumination, resolution, cameras, etc., and are therefore more challenging. Figure 7(b, c, d) show typical results. Note how in all sequences, in addition to smooth transition in facial expression, the pose changes smoothly.
Examples of Face Movies created by people can be found on YouTube, here are links to a couple of our favorites: http://www.youtube.com/watch?v=lydaVvF3fWI and http://www.youtube.com/watch?v=q9h7rGmFxJs.
6.1. Rendering style
For the Picasa implementation, we found that people prefer seeing more of the photos beyond just the cropped faces, as the wider field of view provides more context, and instead of showing photos one at a time, we layer the aligned photos over one another as shown in Figure 8. The user interface also provides the ability to output the movie, upload to the Web, or add audio, captions, and customize appearance in several other ways.
We presented a new technique for creating animations of real people through time, pose, and expression, from large unstructured photo collections. The approach leverages computer vision techniques to compare, align, and order face images to create pleasing paths, and operates completely automatically. The popular photo browsing tool Picasa has an implementation of this approach, known as "Face Movies," which has seen widespread deployment. Key to the success of this method is the use of the cross dissolve, which produces a strong physical motion and illumination change sensation when used to blend well-aligned images. We analyzed this effect and its operating range, and showed that, surprisingly, cross dissolves do indeed synthesize true edge motion and lighting changes under certain conditions.
This work was supported in part by National Science Foundation grant IIS-0811878, the University of Washington Animation Research Labs, Adobe, Google, and Microsoft. We thank the following people for the use of their photos: Amit Kemelmakher, Ariel McClendon, David Simons, Jason Fletcher, and George W. Bush. We also thank Todd Bogdan for his help with the Picasa implementation. Pictures of George W. Bush are used with permission by Reuters (photos 2, 4, 8 in Figure 1; Figure 7(c) photos 1, 4, 5, 8, 10, 11, 12) and by AP Photo (photos 1, 3, 5, 6, 7 in Figure 1; Figure 7(c) photos 2, 3, 6, 7, 9).
11. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49. University of Massachusetts, Amherst, 2007.
21. Picasa, 2010. http://googlephotos.blogspot.com/2010/08/picasa-38-face-movies-picnik.html.
a. We note that the Gaussian is an imperfect PSF model,12 but still useful as a first-order approximation.
The original version of this paper is entitled "Exploring Photobios" and was published in ACM Transactions on Graphics (SIGGRAPH), 2011.
Figure 2. The face graph. Face photos are represented by nodes, and edges encode distance (inverse similarity) between faces. Distance is a function of facial appearance (expression, hairstyle, etc.), 3D head pose, and difference in the date/time of capture. Once constructed, we can compute smooth transitions (red arrows) between any pair of photos (e.g., red and blue bordered photos in this example) via shortest path computations.
Figure 3. Automatic alignment and pose estimation. We first localize the face and estimate fiducial points (e.g., eyes, nose, mouth). Then a 3D template model is used to estimate pose and to warp the image to a frontal view.
Figure 4. Appearance similarity is calculated separately for eyes, mouth and hair. For each region we show the query image and its nearest neighbors from a dataset of photos of the same person across many ages and expressions. Note how a closed eye input retrieves other faces with mostly closed eyes, and similarly for open mouth and hairstyle.
Figure 5. Cross dissolve synthesizes motion. Notice how the edges of the nose and mouth move realistically, as does the lighting (more clearly seen in video: http://grail.cs.washington.edu/photobios/).
Figure 6. Cross dissolve of sine and phased-shifted sine, with small (0.3), medium (0.6), and large (0.9) shifts. We show film strips of the cross dissolve, with rows corresponding to times t = 0 (input frame), 0.25, 0.5, 0.75, and 1 (input frame). The location of the edge is marked in red and the location corresponding to a linear motion is marked in blue. The displacement of the red and blue lines for larger shifts demonstrates the nonlinear ease-in, ease-out speed curves (better seen in the video). Also note the decrease in contrast for larger shifts. To better visualize the (nonlinear) edge motion (for a 0.9 shift), we remove the contrast change (far right) by scaling the image by the inverse of c.
Figure 7. Example paths produced with our method using different types of datasets: (a) time-lapse photo collection of Jason (1598 photos), (b) personal photo collection of Ariel over 20 years (1300 photos), (c) George W. Bush photo collection (530 photos), and (d) Amit's photos (584 photos over 5 years). The end points (marked in red) were chosen by the user and all the intermediate pictures were selected automatically by our method. Note the smooth transition in facial expression as well as pose.
©2014 ACM 0001-0782/14/09
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.
No entries found