Research and Advances
Artificial Intelligence and Machine Learning Research

Beyond Deep Fakes

A conceptual framework and research agenda for neural rendering of realistic digital faces.

Posted

Within the next five years, the way we work, live, play, and learn will be changed by digital humans (chatbots and avatars with very realistic human faces). Digital humans are already gaining popularity as social media influencers, and they will soon evolve into digital sales assistants, fashion advisers, and personal shoppers able to model how customers will look and move in the latest ensembles.

Back to Top

Key Insights

  • Neural rendering, a technique to produce highly realistic human faces, is far more than the term “deep fakes,” which carries strong negative connotations.
  • We outline four use scenarios: face swapping, language translation, beautification, and giving AI a face and voice.
  • We present a four-step conceptual framework for deriving neural rendering use cases: the source (still images, video, statistical models), the target (existing video, bespoke video, puppeteered), the controller (video, human, AI), and the intent (impersonation, oneself, new character).

Digital humans will become central to the multibillion-dollar fashion industry, as social media is further integrated into the retail customer experience. Digital humans will also help in healthcare, enabling medical students and social workers to develop better interview skills for patients in sensitive clinical settings. They will allow people, especially those with mental health challenges, to rehearse for job interviews. They will help keep elderly people connected to their communities and respectfully monitored so they can remain in their homes longer. They will provide a human face for personalized advice, support, and training—and do it at scale.

This has become possible with the advent of cost-effective, highly realistic, personalized interactive digital agents and avatars sporting high-fidelity facial simulations powered by advances in both real-time neural rendering (NR) and low-latency computing.

NR refers to the use of machine-learning (ML) techniques to generate digital faces or face replacements in video.17 NR rose to prominence with the advent of so-called “deep fakes”—the replacement of someone’s face in videos with an NR-generated face of remarkable realism. The term originates from the name of a Reddit user (/u/deepfakes), a ML engineer who posted the original deep fake auto-encoder. Often used for satire, deep fakes can be harmful, presenting novel ethical issues. The best-known examples involve deep fakes of celebrities, a form of face “hijacking” whereby publicly available videos of a person are used to train an ML program that overlays the source person’s face onto existing video footage; this technique was originally used in pornographic material.

However, NR can do more than swap faces in videos. Unlike traditional computer-rendered human faces (characterized as modeled, textured, animated, and rendered), NR digital humans are inferred from video training data using learning algorithms.17 The computer produces a plausible image of what a person might look like in a particular setting, lighting, and pose. As such, NR solutions are significantly different in process and can provide new and powerful applications beyond most traditional animation approaches. Moreover, the rapid advancement of computing technology means after training, a face can be executed or inferred extremely quickly, enabling previously unrealized interactive applications.

The deep-fake phenomenon is an element of larger progression in creating photo-realistic digital characters, avatars, and agents.12 These digital human entities are being adopted in many industries, such as entertainment, gaming, fashion, education, and communication. The field of digital humans extends from digital representations of people in videos to fully interactive synthetic digital agents and virtual avatar representatives. These agents are now appearing as influencers and organizational representatives as well as complex avatars that substitute for their owners in virtual meetings or events.

Recently driven by the COVID-19 pandemic, many digital production companies are investigating cost-efficient ways to generate fully digital characters that require only minimal involvement from real actors. This has led to significant growth and the emergence of a range of startup businesses, and many of these have attracted significant sums in venture capital investment. For example, virtual influencers such as Lil Miquela have allowed Brud to raise more than $125 million from investors and Synthesia recently raised $90 million in a series C funding round, while virtual mentors such as Digital Deepak Chopra and the AI Foundation previously raised more than $10 million. The popularity of virtual performers in events such as Travis Scott’s Fortnite concert or the John Legend digital concert enabled The Wave to raise more than $30 million and virtual real-world celebrities company Genies to raise $35 million.2,8,11,14,15,16


The use of video material to generate digital faces and the application thereof in various contexts raises myriad ethical, moral, and legal questions.


NR is set to revolutionize the field of natural face technology12 by presenting faster and more cost-effective renderings than traditional computer graphics (CG). Whereas in CG, a face must be painstakingly built from scratch, NR can infer faces by using existing video footage as training data.

As such, NR is a new and rich field that combines fast-moving technological advances, but there is a lack of social science research on their implications for individuals, organizations, and society. The focus of this article is thus on the sociotechnical implications of the different types of NR rather than the underlying technical research driving its development.

Back to Top

Background

Following a brief introduction of traditional approaches to generating realistic digital humans using CG techniques, this article then introduces NR, from the field of artificial intelligence (AI) research, as an emerging alternative rivaling CG in quality but with much lower costs and some operational advantages. This section ends with an overview of NR techniques for producting highly realistic characters.

Traditional CG approaches to faces. With traditional CG approaches to digital face generation, the target face is built stepwise using labor-intensive modeling and computer animation techniques.3 It typically starts with a 3D model, either built digitally by hand or sampled by scanning a real face. In both cases, the CG model is then rigged with an animation control system, textured, artificially lit, animated, and rendered with some form of light simulation.13 Achieving the required accuracy in face detail and lighting complexity can be an extremely time-consuming process.

A central goal of CG face generation has always been realism. However, achieving the final stages of near-perfect realism through a traditional CG pipeline requires an exponential amount of effort, necessitating both highly skilled artists and extended render times.13

For decades, algorithmic advances and simulations have aimed to better emulate reality with ever more accurate simulation models. For example, higher model fidelity and skin response were made possible by advanced facial scanning approaches, such as Light Stage.3 More complex lighting and rendering approximations to the way light interacts with materials have become more commonplace as lighting systems moved from earlier forms of rasterization to advanced sampling techniques in bidirectional path-tracing or ray-tracing solutions. Coupled with these progressive improvements to solve the lighting equation, computer hardware has advanced with increasingly complex equations.

For example, in the MEETMIKE project,12 the fully digital MIKE character was rendered in the Unreal Engine using rasterization. The character looked realistic and was interactive, being controlled as a digital puppet by a human and rendered in real time. This project used advancements in game engines, specifically the Unreal Engine, to achieve a high frame rate and realistic representation of the source actor.

As ML advances, traditional CG face generation has embraced it in the CG pipeline, and it is not uncommon for a modern feature film CG pipeline to contain ML subsystems.12 For instance, while digital CG MIKE itself is not generated with NR, it uses ML techniques and computer-vision tools to aid in both the modeling and the real-time facial decoding of the expressions of the source subject. The CG face on the left in Figure 1 is controlled by head-mounted infrared computer-vision cameras. The pair of images on the right allows for a real-time reconstruction of the facial performance seen on the left.

f1.jpg
Figure 1. MEETMIKE (left). A traditionally rendered CG character, controlled in real time in VR. Watch video at https://vimeo.com/manage/videos/434595748.

Neural rendering of faces. NR techniques are defined as “deep image or video generation approaches that enable explicit or implicit controls of scene properties such as illumination, camera parameters, pose, geometry, appearance, and semantic structure.”13 Key to this definition is that it goes beyond simple generative ML approaches and encapsulates controllable image generation. The ability to direct the rendering process with some control makes the range of possible techniques wide and enables novel view synthesis, semantic photo manipulation, facial and body re-enactment, relighting, and free-viewpoint video.13 This allows for a broad range of use cases, from the widely discussed image manipulation (deep fakes) to the creation of photorealistic avatars for virtual reality (VR) and augmented reality (AR), virtual telepresence, and digital assistants.

Much of the published research on NR has focused on the computer algorithms and big-data analysis techniques for generating digital faces. At the technical implementation level, NR often uses neural networks built on the seminal work of Ian Goodfellow on generative adversarial networks (GANs),4 combined with variational autoencoders (VAEs) from researchers such as Pushmeet Kohli.6 The original deep-fake image manipulation, based on autoencoders (AE) with ML, is used to find the latent feature space or to embed space that allows data, such as a face, to be encoded and then decoded with an acceptable error loss. This basic AE approach is no longer the most advanced of the NR approaches,1 but the term “deep fake” remains the catch-all phrase applied rather loosely to NR applications. The CVPR State of Neural Rendering13 does an excellent job of summarizing many of the current technical innovations.

Unlike the iterative improvement to the modeling, texturing, lighting, and rendering of a traditional CG approach, NR methods infer digital faces with statistical techniques from training data.13 Usually, a source face is used as training data to infer and apply a new face or expression to a target face in the final video or interactive experience. Other training data is used as input, and at times additional control inputs are also applied to moderate or enhance the final reimagined target character. While the source data is typically video, in the simplest form, a face can be rendered or inferred from a single image, as shown in Figure 2. The same actor is used throughout the article to illustrate the range of NR examples and to better facilitate visual comparison; using different source actors makes direct comparison more difficult. Links to video samples are provided, since movement is a key component of NR.

f2.jpg
Figure 2. A real person (left) and an animated digital human inferred from a single jpeg image (right), created using Pinscreen neural rendering.

Using deep-learning approaches, it is possible for the computer to learn both the source and target human faces and then to infer what a human face would look like in a particular position, with particular lighting and a specific human expression.11 Related techniques can go even further and infer or invent a new face that has never existed.14

While NR techniques do not yet allow for a full range of complex movements, the level of realism achieved has proven to be often equal to, or to exceed, that of more traditional approaches. In addition, due to its technical approach of using training data, NR is often characterized by reasonable preparation times but much faster final rendering.

Example techniques for NR faces. There are three main techniques that use NR for face generation and animation:

  • Face swapping with existing video.
  • Face synthesis that generates completely artificial faces.
  • Hybrid solutions that combine CG modeling and face swapping.

Face swapping and deep fakes. The approach that has most commonly been used for high-quality face swapping is image-to-image translation using variational autoencoders,11 where training data is provided for both faces. To position a new face onto a target face, the system requires adequate training data for both faces in similar poses and lighting. More recent research advances have allowed for innovative approaches to real-time face swapping without extensive prior target face training data; however, quality can be improved with more training data.5

This type of face hijacking was enabled and extensively disseminated due to the open source publishing of several implementations. The earliest approaches to automatic face replacement incorporated 3D-morphable models.11 Initial AI approaches aimed to use ML to find the single best match to a single image and then blend it into the target image. However, these early attempts were either unrealistic or limited in application to still images. The deep fakes approach is believed to be based on the work of Korshunova et al. (2016), which used a convolutional neural network (CNN).11 Many of the basic ideas in use today in newer NR face-swapping techniques share basic characteristics from this work. The original research was subsequently extended, and an open source platform for practical deep fakes, called Faceswap, was released in 2017. This served as the basis for DeepFakeLab, released by Ivan Petrov in 2018.11 It is believed that most celebrity face swaps in circulation on platforms such as YouTube were created with DeepFaceLab, with Petrov himself putting the share at 95%. One popular example is the DeepTom-Cruise series by Chris Ume.11

It is worth noting that face replacements during early 2017 were effective but there was often difficulty in generating high-resolution imagery due to memory limitations or a failure to blend lighting and skin texture (see Figure 3). Other versions exhibited temporal instability, meaning that they looked acceptable on still frames but sometimes flickered on moving clips.5

f3.jpg
Figure 3. An early face swap or deep fake.

Improvements have included better temporal-contrast matching and skin blending, but insufficient training data at the appropriate resolution can still lead to images that appear Gaussian or do not match the correct head orientation. The training data provides the training space within which the best solution can be achieved. While some extrapolation is possible beyond this, the best results require training data at the same or higher resolution than the target and with appropriately similar facial orientation and lighting. For example, a side view is not easily inferred from front-looking training data.

The deep-fake example in Figure 3 used the stylized autoencoder (SAE) option to swap a source face from an exterior sourced clip onto an interior grabbed clip. The source face in the training footage had top lighting, which produced shadows, so the resulting target face has unrealistic shadowing under the chin and on the ears. The Poisson blending seen in Figure 3 cannot fully address this training data mismatch, and this leads to a visually distracting result.

FaceSwap is a traditional autoencoder architecture trained on pairs of images from two identities, whereas DeepFaceLab’s approach uses a dual Y-shaped autoencoder implementation architecture.15 In its more recent network design, DeepFaceLab offers lightly improved autoencoder (LIAE), which incorporates two distinct ‘bottleneck’ sections or latent space constraints that have the features of both images fed to it and the other of only the destination face. These approaches share both encoder and decoder (InterAB). Because of the sharing inters, the lighting of the source can be swapped to the new face. DeepFaceLab can also incorporate an optional GAN loss that can be enabled part-way through training the autoencoder, where a discriminator provides an additional loss term.

This is not a solved problem; there is also commonly an ‘identity leakage,’ where the original face seems to momentarily appear in a resulting output clip. The VAEs are trained to fit to only two identities and often the data being used to train will be used to evaluate the result. The video the user might want to face-swap onto is often part of the training set. This is not a protocol for a normal ML generalized solution. The resulting ML overriding is often not considered a problem since the user hopes to accomplish a specific illusion, and a generalized solution is not a strong requirement.


NR is an emerging technology that is revolutionizing the field of face generation and animation in film, entertainment, and gaming.


Newer solutions improve on many aspects of earlier software but are still dependent on good training data. For example, even footage with matching lighting, resolution, and camera angles can still produce inadequate results if the training data includes motion blur or facial occlusion. Algorithms are improving constantly, but automatic occlusion in the target video remains only a partially solved problem and often requires ML segmentation-based image correction and compositing.

Control or ‘art directability’ of this class of NR is not a requirement. The ability to creatively vary the final composite’s facial features in interesting ways is not the goal. These main face-swapping approaches seek to match like-for-like lighting, orientation, and expression of one face with the identity of another.

Face synthesis. GANs excel at generating random realistic faces using statistical big-data approaches that result in a new face, resembling an amalgam of the training dataset.11 For example, the output face would look like a fashion model with classically defined beauty if built from a dataset consisting of classically beautiful fashion models.

AI models can also be trained on the visual data of a single individual to produce a digital human that resembles that individual (see Figure 4). However, the digital human is bound by the training data. For example, if the training data does not include the back of the individual’s head, the digital human cannot turn around. The remarkable realism of this form of digital human highlights the power of AI.

f4.jpg
Figure 4. Entire digital person NR by Synthesia (U.K.).

Efficiency may trump realism when considering organizational use. For example, if a warehouse worker needs training on several different safety topics or company policies, and the choice is between reading a five-page PDF manual or watching a two-minute video, one can imagine the appeal of the video. The worker does not care about how it was created or if it is perfectly realistic. The video serves a purpose and allows them to move forward.

Hybrid NR and CG solution. A more recent yet powerful NR approach is a hybrid solution that combines both real-time CG and a face overlay. In Figure 5 from the LA-based company Pin-screen, a neurally rendered MICHAEL is animated with the base video being fully digitally CG and the inferred synthetic face blended in to increase realism. The result exhibits both photographic realism and is digitally generated. Figure 6 shows the difference in quality between a purely CG-built character and the hybrid version.

f5.jpg
Figure 5. A fully digital MICHAEL produced as a hybrid.

f6.jpg
Figure 6. A comparison of a CG-only face: MIKE (left) and the hybrid version using paGAN 2: MICHAEL (right).

Combining the face swap inside the render engine that produces the base CG character has three advantages. First, the software must isolate the face and orientation of the person being replaced. If this is done in the engine, it means the computer already knows the head position and orientation, thus reducing errors and improving the efficiency of the process.

Second, earlier methods have issues with occlusion, which required additional processes to allow the face to blend behind a hand or an object in front of it. In the hybrid model, the engine is aware of the position of such objects, since they are computer generated. It can therefore perform a face replacement without special occlusion allowances that would normally be required to deal with the objects obscuring the face.

Third, earlier methods required quite specific training data, which can be time-consuming and difficult to obtain. To make a more generalized, robust, and stable solution, several additional steps, including the use of extra AI in the process, can be deployed. The underlying CGI model allows for the generation of useful synthetic training data for the NR process.

A neurally rendered digital human can be deployed as a digital influencer or company spokesperson. Companies such as Digital Domain (LA) are experimenting with virtual assistants in team video conference calls. The company has already had success with testing and creating virtual concierges and digital call-center agents.

Back to Top

Examples of NR Faces

There are a variety of ways in which NR can be applied in the emerging field of digital humans. We present four examples of how the technology works in practice to illustrate how the inferencing is applied. We then draw on these examples in the next stage to develop our conceptual framework.

Face-swap impersonation (Type 1). This type of application builds on the original deep-fake idea and refers to the replacement of a face with the face of another person, where the replacement face is animated by the motions and expressions of the source face (see Figure 7). This allows for the impersonation of one person by another, either using recorded footage or live video.

f7.jpg
Figure 7. NR impersonation (left) and the source (right). Watch the video at https://vimeo.com/manage/videos/434564568.

This application makes it appear that someone has said or done something they did not actually say or do. For example, a stunt performer could have the source actor’s face incorporated into the stunt, completely masking the identity of the stunt person. Here, the inference is used to produce the actor’s new face, which shows the expressions of the source actor placed into the context of the target video.

This type of NR was used extremely effectively in the documentary Welcome to Chechnya. Ryan Laney is a film professional with a long history of using technology to support storytelling.11 He completed Welcome to Chechnya in 2020, working with the filmmakers to use NR as a novel technique to protect the identities of members of the LGBT population filmed in the Russian republic of Chechnya. Using NR, 23 individuals or witnesses shown in the film had their faces swapped with the faces of volunteers or actors in 480 shots during post-production.

Video dialogue re-enactment (Type 2). In video dialogue reenactment, the face is not replaced with a different one but is animated by another person who now determines facial movements and expressions. This technology enables a person to speak in a different voice or language. An example would be as a replacement for dubbing or subtitles (see Figure 8). An actor could be seen to accurately speak in a different language, and the audience would also hear the dubbing actor’s voice as if it were the main actor speaking. It is also possible that a new voice can be synthetically generated for the target performance. This differs from Type 1, as the target individual retains the same face identity but is now controlled by another.

f8.jpg
Figure 8. Original unaltered (left) and VDR (right). Watch the video at https://vimeo.com/manage/videos/434556410.

The Champion was the first full feature film to leverage NR using this bespoke process, allowing the actors to be converted from performing in Polish and German to English. In this drama, set in World War II, the production was filmed and finished in one language; then, using NR, the actors’ faces were replaced speaking in English, visually inferred from actors recording dialogue in a sound studio.13 The story was filmed without any consideration of later dialogue replacement and encompassed hundreds of shots. The final film appears as if shot in English.

De-aging or digital beauty work (Type 3). The third application type maintains the original identity but aims to alter the appearance, such as de-aging or applying digital makeup. There are a range of tools for digital alterations to captured footage. For example, photo manipulation of static images with Adobe Photoshop is common. Adobe has introduced neural filters that allow for digital makeup or aging to be applied to a still image with NR technology (see Figure 9).

f9.jpg
Figure 9. Photoshop neural filters are used to vary appearance.

Illusions can also be created on moving footage with standard effects tools, such as The Foundry’s Nuke or Adobe AfterEffects. However, these solutions are not fully automated and require human artistry. There are automatic, real-time image-processing tools for creating live imagery, such as Snapchat filters, but they are rarely photorealistic. The original 2014 Snapchat Geofilters were simple overlays, but they have since evolved into complex AR tools that often use AI. The primary intent of these filters is for entertainment with a focus on mobile use. Filters such as the “Burgundy Makeup” Snapchat Lens or the “Skin Smoother” Snapchat Filter aim to provide realistic digital makeup, augmenting people in real time with skin, eye, and lipstick enhancements.

Advanced non-real-time digital makeup can reliably de-age actors, using a suite of NR tools. Figure 10 shows a clip of a person de-aged with NR, and the result compares favorably to a video of the same person recorded 10 years earlier. It is not an identical reproduction, but rather a plausible and realistic synthetic inference based on appropriate training data.

f10.jpg
Figure 10. Real person (left), and made to look 10 years younger (right). Credit: Thaigo Porto. View video at https://vimeo.com/manage/videos/434556584.

Digital assistants (Type 4). NR can be used to create interactive digital assistants with visually complex faces derived by an AI engine. There are also a variety of virtual companions, intelligent assistants, and chatbots that could derive a face but currently, the only output is text or audio, such as Amazon’s Alexa and Apple’s Siri.

Technologies have been developed that produce audio-driven, AI-based facial animation, such as NVIDIA’s Audio2Face. This technology enables a synthesized voice to drive a realistic facial animation system or a stylized Animojis-type system.18

Companies such as Soul Machines have implemented digital humans as interactive customer-facing representatives. Madera Residential’s digital leasing consultant Mia is one example (https://youtu.be/6K85bUFtOSo). Many future digital assistants with realistic faces will be synthetically inferred, in addition to current CG approaches.

Synthesia has developed an entire platform for NR-generated assistants, with the aim of using only code, not cameras, to make films. The primary business enables the creation of NR digital people, based only on audio or written input. This service increased significantly and exponentially during 2021. The majority of Synthesia’s 12 million generated videos have been since April 2021.11 As a secondary R&D and testing ground, the company annually produces one bespoke NR AI special project. These projects include two ‘Malaria No More’ face-replacement spots with soccer player David Beckham and, most recently, the personalized Messi Messages for PepsiCo. The latter not only provides a NR celebrity at scale, but the messages are personalized by the NR avatar of soccer player Lionel Messi, who addresses each user by name. This technology has since been folded into the company’s main avatar business. This enables the API to automatically generate digital humans, which can be used by companies, for example to personally thank shoppers at the end of e-commerce sales interactions.

Back to Top

A Framework for NR Faces

NR is an emerging technology that is revolutionizing the field of face generation and animation in film, entertainment, and gaming. However, it also has potential for positive application in business, education, and health, alongside its malicious uses, as demonstrated by the deep-fake phenomenon. Having provided example applications, we now introduce a sociotechnical framework that integrates technical aspects of face production with the social aspects of its use. This framework conceptualizes NR applications with a view to support both the research and practice of using NR to design, deploy, and use highly realistic human faces in a wide variety of applications. This framework distinguishes different source materials for NR faces, how faces are deployed onto different target material, how faces are controlled, and use scenarios.

Back to Top

How a Face Is Inferred (Source)

Much of the research on NR has explored the algorithmic nature of face inference. While a discussion of the full range of technical details is beyond the scope of this article, it is important to note that the source material used in NR applications is critical, as it relates to pertinent questions about use—for example, whether images posted in the public domain can be used to generate malicious face swaps. This application raises important questions regarding copyright, ethical issues, and responsibility.

Inputs to the NR process include the material containing the face data used to derive the new face. Three types of source data can be distinguished as inputs in the inference process: a still image(s), a video clip(s), or a statistical model derived from large amounts of training data of different people.

Most NR rendering applications today use video material as input. The source can be either existing material or new bespoke material that is shot and lit for better quality control. Types 1, 2, and 3 in the Examples of NR Faces section all use video as the source material. In Type 1, the actor’s face was generated from existing material in the public domain; for Type 2, bespoke material was created; while Type 3 used existing private footage.

A third way to bring new faces into existence for NR applications is digital face synthesis from statistical models, as previously discussed. While new faces generated from statistical inference are not easily animated, this is a rich area for future research.

How the face is deployed (Target). Once face data has been generated by NR, it is deployed onto the target material. The inferred face is positioned onto an existing body/head, thus performing a face swap. This can be done with different types of target materials. We distinguished three different types that offer increasing degrees of control. First, a face swap can be performed on existing video footage available in the public domain—for example, faces of politicians, which is what was done in the original deep fakes. A second way to derive target footage is to create bespoke video material onto which the inferred face is placed, allowing for better control of the scene and a wider range of expressions. A third way is to produce a unique new performance, by puppeteering the NR face (by AI or human) to create a video or a live interactive performance. The Synthesia process of a NR person is such an example (Figure 4).

How the face is controlled (Control). The next step in the framework is what drives the NR face. There are three possibilities. The first is to have the face driven by facial expressions in the source video, regardless of whether the footage is pre-existing or bespoke material.

A second way is to insert a third party whose facial expressions drive the face, either in a recording or in a live animation. VDR or facial reenactment uses this type of face control, as the facial movements are recognized in a third-party video and that drives the NR face animation to match their voice. A real-time puppeteering of a digital avatar using a face rig falls into this category.

The third option is to use an independent controller, such as AI or a human, so the output face is not controlled by a video but by an independent source, such as a purely voice-based solution.18 This form of control is necessary for creating virtual assistants, as shown in Type 4 of the Examples of NR Faces section. Many factors must come together to enable real-time interaction with realistic faces, including natural speech understanding and contextual appropriateness.

How the character is used (Intent). There are three main ways that NR characters can be used, namely, to impersonate another character and make them do or say something they did not actually do or say, to animate a version of oneself or present a different version of one’s actual appearance, or to generate an entirely new and/or synthetic character.

In the first example, impersonating another person, the intent could be to bring back to life a deceased actor in a film, or it could be to deceptively impersonate someone without their consent. Such deception can be for comedic effect, for fraudulent purposes, or to do harm. Type 1 as described in the Examples of NR Faces section, which was created for educational demonstration purposes, falls into this category, but so do typical deep fakes.

The second example is to animate one’s own face to either create a simple digital avatar for real-time use in online contexts, such as virtual reality,12 or to present oneself in a way that differs from reality. This includes a range of use contexts, such as de-aging (Type 3) or to give the impression that one is speaking a different language, as with VDR in Type 2.

The third example is the creation and use of an entirely new synthetic character not based on a real person. This could be used to hide an identity while using an avatar or to put a face on a virtual digital assistant.

Back to Top

Summary

This framework distinguishes four separate areas of design concerns that should be considered when creating digital humans using NR (see the Table). When neural rendering a character, the four areas can be interpreted akin to a process model, as a sequence of steps representing different design choices. Simultaneously, the model is useful to analyze and classify existing examples such as the ones presented in the Examples of NR Faces section.

ut1.jpg
Table. NR classification framework.

Figure 11 shows the classification of the four types of NR applications previously presented using our framework. We also added the example of a simple deep fake. Note that deep fakes present only one pathway through the framework, which signifies again the difference between the technology of NR for digital face inference and animation, and deep fakes as one example application.

f11.jpg
Figure 11. Classification of NR faces using the framework.

Back to Top

An Emerging Research Agenda

NR is closely associated with the deep-fake phenomenon. We advocate that deep fakes should be viewed as one form of this application, as NR promises to introduce a range of new applications not associated with the often malicious and harmful uses of deep fakes in the public sphere.

As with many new technologies, bad actors may deploy some of these NR advances in ways that are unethical or even deliberately damaging. Such problems were encountered with the advent of widespread digital retouching of still images with programs such as Adobe Photoshop. There are technical approaches to be researched, outside the scope of this article, to detect such modified material. A primary weapon in combating misuse is an informed population. Articles such as this, explaining and alerting the community to the extent of possible alterations, can raise awareness and promote a healthy skepticism against abuse.

In this section, we highlight sociotechnical research avenues that emerge from this broader view of NR as an enabling technology for generating and animating realistic digital faces. We omit research on the technical ML aspects, as these cover a much broader class of AI problems and use cases.

Interactivity opens a wide range of potential use cases (for example, advisory contexts) and research opportunities (for example, user acceptance and efficacy). While some of these applications could be done with traditional visual effects and animation, NR often provides higher fidelity, faster run-time implementations, and the ability to scale otherwise costly modeling, while simultaneously offering personalized digital humans with individuality.


Future research should investigate how users react to and accept various forms of NR digital characters in different usage contexts.


Areas of application. Many design questions remain unanswered. Indeed, a range of applications for NR faces can be envisioned, each with unique questions:

  • What will be the impact of NR-generated characters in entertainment and gaming? Will there be a substitution and salary effect on the use of human actors? How will NR contribute to more inclusive narratives that adjust backgrounds or languages of characters to different viewing demographics?
  • What are the implications for the audience and wider industry acceptance of NR-based dubbing in entertainment and training videos, delivering the same videos in a range of different languages with believable facial expressions?
  • Will NR enable consumers to see themselves in a different way? How can NR enable more accurate digital outfit try-ons and sizing to reduce returns and increase customer engagement?
  • What are applications when we consider interactivity? Could a NR character be used to train physicians to treat patients with mental illness? If so, how? Can NR be used to assist people living with an acquired brain injury, such as a palsy from a stroke, for example, by presenting a version of themselves without the partial facial droop and helping to reestablish a plausible version of their prior selves?
  • Can aging/de-aging of consumers influence decision-making (for example, when making financial retirement decisions for one’s older self)? What are the implications of marketers varying the ethnicity and gender of characters to tailor them to a customer’s demographics?
  • Can de-aging be suitably used in health or aged care to help stroke, dementia, Amyotrophic lateral sclerosis (ALS), or Alzheimer’s patients by connecting them with younger versions of their relatives or themselves?
  • NR allows for anyone’s face to be used; does it make sense to deploy celebrity at scale, with individualized messages and interaction from a celebrity speaking ‘directly’ with you.

User reactions and acceptance. The use of video material to generate digital faces and the application thereof in various contexts raise myriad ethical, moral, and legal questions, many of which will only emerge in due time. Future research should investigate how users react to and accept various forms of NR digital characters in different usage contexts. Typical studies might include, but should not necessarily be limited to, the following inquiries:

  • Will the perceptions and behaviors of viewers differ if they are aware they are interacting with NR characters versus when they are unaware?
  • How (well) will users be able to distinguish between NR and real characters in various contexts, particularly in advertising and political (mis)information?
  • Will individuals trust digital characters in different contexts, such as education, politics, advertising, or business? Under what conditions?
  • Will people more readily accept digital assistants when they come with a realistic human face? Under what conditions?
  • Will people perceive a NR avatar as speaking for someone or as a separate identity? Will promises be as binding emotionally if they emanate from an avatar?

NR characters in social research. In addition to research on the design and application of NR-based digital humans, NR also provides opportunities to advance research in the social and psychological sciences. For instance, NR enables the generation of digital faces as research instruments to study human perception of facial features in ways that have never been possible before, such as the study of bias:

  • What are the effects of nuanced variations in skin color, age, or other facial features on human decision making or perception of character traits?
  • What are the effects of different faces when coupled with the same voice and vice versa?
  • Would you be more likely to be influenced by a NR version of yourself promoting a position?

Other forms of NR characters. Figure 11 showed five current examples of NR faces. There are many other possibilities as suggested by this framework:

  • A museum could create an exhibit in which the Duke of Wellington describes the battle of Waterloo. The face could be inferred from a painting (source: a), deployed on a new target (target: b), using a written script (control: c), as an impersonation (target: a).
  • The government could create a public service announcement customized to use a presenter of a different ethnicity for different target audiences. It could use statistically inferred faces of different ethnicities (source: c) to deploy a face on a purposefully created video (target: b) controlled by that video (control: a) or separate audio (control: c) to create a synthetic character (c).

Back to Top

Conclusion

Digital human characters produced by NR will become part of our everyday lives. This article provides a framework and examples for the application of this new and emerging technology. It also offers a set of initial research ideas.

Digital characters created by research labs and commercial entities have the potential to reshape social interactions in business and society. NR applications thus come with great potential for entertainment, commerce, education, and healthcare yet also with grave concerns because of their potential for malicious use in the creation of disinformation and fake news.1

Back to Top

Acknowledgments

Thanks to Jim Shen, Adapt Entertainment; Pinscreen; Thaigo Porto; Epic Games; Synthesia; and CannyAI for their input and support of this work.

Back to Top

    1. Agarwal, S. et al. Protecting world leaders against deep fakes. CVPR Workshop (2019), 38–45.

    2. Bradley, S. Even better than the real thing? Meet the virtual influencers taking over your feeds. The Drum (2020); https://bit.ly/3YD06k2.

    3. Debevec, P. et al. Acquiring the reflectance field of a human face. In Proceedings of the ACM Siggraph 2000, 145–156.

    4. Goodfellow, I. et al. Generative adversarial nets. Mining of Massive Datasets 2nd Ed. (2014); doi:10.1017/CB09781139924801

    5. Karras, T., Laine, S., and Aila, T. A. style-based generator architecture for generative adversarial networks. CVPR (2019).

    6. Kulkarni, T.D., Whitney, W.F., Kohli, P., and Tenenbaum, J.B. Deep convolutional inverse graphics network. In Proceedings of the 28th Intern. Conf. on Neural Information Processing Systems 2 (Dec. 2015), 2539–2547.

    7. Leo, M.J. and Manimegalai, D. 3D modeling of human faces—A survey. In Proceedings of the 3rd Intern. Conf. Trendz in Information Sciences & Computing (2011), 40–45; doi:10.1109/TISC.2011.6169081.

    8. Mitchell, V. Salesforce Ventures part of US$40 million investment into Soul Machines. CMO Australia (Jan. 10, 2020); https://bit.ly/44mnWC4.

    9. Naruniec, J., Helminger, L., Schroers, C., and R.M., Weber. High-resolution neural face swapping for visual effects. In Proceedings of the Eurographics Symp. on Rendering 30 (2020), 1–15.

    10. Seymour, M., Riemer, K., and Kay, J. Actors, avatars and agents: Potentials and implications of natural face technology for the creation of realistic visual presence. J. Assoc. Information Systems 19, 10 (2018).

    11. Seymour, M. Deep neural rendering comes of age. fxguide (Dec. 16, 2021); https://bit.ly/3OWtbnj.

    12. Seymour, M. et al. Facing the artificial: Understanding affinity, trustworthiness, and preference for more realistic digital humans. In Proceedings of the 53rd Hawaii Intern. Conf. System Sciences 3, (2020), 4673–4683.

    13. Seymour, M. The neural rendering of the champion. fxguide (2022); https://bit.ly/44h8PcV.

    14. Shieber, J. More investors are betting on virtual influencers like Lil Miquela. TechCrunch (Jan. 14, 2019); https://tcrn.ch/44dJrVm.

    15. Stone, L. Partnership on AI, Kodiak Robotics, Faraday Future, more take out PPP loans. AI Business (2020); https://bit.ly/3OHzIRq.

    16. Takahashi, D. Wave raises $30 million for superstars to stage virtual concerts | VentureBeat (2020); https://bit.ly/45ubL71.

    17. Tewari, A. et al. State of the art on neural rendering. In Proceedings of the Computer Graphics Forum (2020).

    18. Tian, G., Yuan, Y., and Liu, Y. Audio2Face: Generating speech/face animation from single audio with attention-based bidirectional LSTM networks. In Proceedings of the 2019 IEEE Intern. Conf. Multimedia Expo Workshop, 366–371 (2019); doi:10.1109/ICMEW.2019.0.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More