Deepfakes are having a moment.
Their dangers are becoming more known and understood. The media is rife with articles detailing the speed at which the technology has grown in sophistication and become more accessible, as well as the risks involved.
Good.
The negative implications of deepfakes are troubling, and the better we understand them, the better we’ll be able to prevent their worst consequences. For better or worse, the technology is here to stay. But there is a “better” here—deepfakes have much in the way of lighthearted upside.
Though the debate around deepfakes has grown in stature and complexity, we still struggle to agree on a definition of deepfakes. I think of it as any mimicry, manipulation, or synthesis of video or audio that is enabled by machine learning. Face-swapping, body puppetry, copying someone’s voice, and creating entirely new voices or images all fall into this category. Your Photoshop efforts, valiant though they are, don’t.
Image synthesis and manipulation can be a powerful tool for creators
Visual storytelling is an expensive business. Hollywood studios spend billions on creating spectacle that wows their audience or transports them to another world. The tools they use to do so—the tools these big players use to close the gap between what they can imagine and what they can create—remain prohibitively expensive for most creators, though less so than a decade ago. Deepfake tech incorporates the ability to synthesize imagery, potentially giving smaller-scale creators a similar capacity for bringing imaginative creativity to life.
Synthesia is a company with a commercial product that uses deepfake tech to do automated and convincing dubbing through automated facial re-animation. They shot to prominence with a video that featured David Beckham talking about Malaria in nine languages, but their product could also be used to expand the reach of creators around the world. If you’re a talented artist who isn’t working in one of the world’s dominant languages, it’s potentially career-changing to have access to a product like this, which could make your work viable in additional languages and countries.
Adobe VoCo is software — albeit still at a research and prototyping stage — that makes it easier for creators to produce speech from text and edit it the way they would edit images in Photoshop. So if you want your movie short to be narrated by Morgan Freeman, you might be able to make that happen.
Tinghui Zhou, the founder and CEO of Humen, a company that creates deepfakes for dancing, sums up the industry’s goals: “The future we are imagining is one where everyone can create Hollywood-level content.” (Disclosure: I am an investor in Humen).
In the same way YouTube and Instagram shrunk the distribution and creation advantage that entertainment companies and famous photographers enjoyed over talented amateurs and enthusiasts, this bundle of technologies might diminish the production advantage currently possessed by big budgets and visual effects houses.
Mimicry and manipulation of real life have always been part of art.
The applications mentioned above are all to do with closing the gap between creators with different resources, but deepfake tech could also enable entirely new forms of content that rest on the ability to mimic and manipulate material. Every medium of entertainment has incorporated the stretching, reflection, contortion, and appropriation of real source material for the purposes of entertainment.
We can already see the evidence of these new applications in the still-nascent use of deepfake tech today. While face swapping for porn lies at the malicious end of the spectrum, more benignly the technology’s introduction also sparked a wave of face swapping Nicolas Cage into different movies.
It might seem banal, but it was a form of content creation that, while previously technically possible, was practically infeasible before deepfakes. It’s not hard to imagine that the next deepfakes content craze will be driven by automated lip-syncing, dance mimicry, or celebrity voice impressions.
Respeecher and Replica.AI are just two companies making voice mimicry accessible to non-techies. Check out my demo with Replica’s tech in San Francisco a few weeks ago (recognize the voice?). It’s a small slice of the future of entertainment and content. If you believe that culture in the digital era is the culture of remixing, then deepfake tech has an important part to play in the creation of that culture.
Deepfakes bring us closer to believable virtual humans
The ability to mimic faces, voices, and emotional expressions is one of the most important steps toward building a believable virtual human that we can actually interact with. We’re already taking tentative steps down the path to virtual humans. Personal assistants like Alexa, Siri, and Cortana have been around for several years, reached a tipping point of consumer use, and are quickly improving. Having said that, in 2019 they still feel more like a new user interface you have to pass precise instructions to rather than a virtual being you can interact with. Think a command line operated by speech.
Virtual humans are entering the mainstream in a different way: Through the recent wave of digital influencers. I previously wrote about this trend in the context of animation history, but digital influencers are also meaningful in the context of believable virtual humans. Digital influencers operate on the same planes of interaction — think your Instagrams and Pinterests — that most people do.
As such, you and I can comment on a Lil Miquela post or message Astro. This is interaction with a being that isn’t real. The digital influencer isn’t really responding to you in their own words — their content is created by storytellers, much as Pixar films have writers. But these digital influencers are laying the social groundwork for interaction with true virtual beings.
Deepfakes have the potential to plug the technological holes in smart assistants and digital influencers. Pushing Alexa or Lil Miquela to the level of virtual humans like Samantha from Her or Joi from Bladerunner 2049 requires the capacity to encompass and express human body language, speech, and emotion. If we counted the number of unique combinations of pose, vocal nuance, and facial expressions you’ve made in your lifetime, it would likely number in the billions. For virtual humans to be believable, their actions can’t be preprogrammed in a traditional hard-coded sense, but must instead be extremely flexible.
Deepfake tech typically takes tons of examples of human behavior as inputs and then produces outputs that approximate or elaborate on that behavior. It could grant smart assistants the capacity to understand and originate conversation with much more sophistication. Similarly, digital influencers could develop the ability to visually react in a believable way in real time, thanks to deepfake tech. Bringing Mickey Mouse to life beyond a Disney cartoon or guy in a suit at Disneyland is where we’re headed. 3D hologram projections of animated characters (and real people) that are able to speak in a realistic sounding voice, moving like their real world counterpart would.
Creativity starts with copying. Elaboration follows duplication. It is no different with deepfakes, which will democratize access to creativity tools in entertainment, enable entirely new forms of content, and bring us closer to believable digital humans. That is why I think there is as much reason to be excited about the technology’s virtues as there is to be concerned about its vices.