This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
In 2018, a big fan of Nicholas Cage showed us what The Fellowship of the Ring would look like if Cage starred as Frodo, Aragorn, Gimly, and Legolas. The technology he used was deepfake, a type of application that uses artificial intelligence algorithms to manipulate videos.
Deepfakes are mostly known for their capability to swap the faces of actors from one video to another. They first appeared in 2018 and quickly rose to fame after they were used to modify adult videos to feature the faces of Hollywood actors and politicians.
In the past couple of years, deepfakes have caused much concern about the rise of a new wave of AI-doctored videos that can spread fake news and enable forgers and scammers.
The “deep” in deepfake comes from the use of deep learning, the branch of AI that has become very popular in the past decade. Deep learning algorithms roughly mimic the experience-based learning capabilities of humans and animals. If you train them on enough examples of a task, they will be able to replicate it under specific conditions.
The basic idea is to train a set of artificial neural networks, the main component of deep learning algorithms, on multiple examples of the actor and target faces. With enough training, the neural networks will be able to create numerical representations of the features of each face. Then all you need to do is rewire the neural networks to map the face of the actor on to the target.
Deep learning algorithms come in different formats. Many people think deepfakes are created with generative adversarial networks (GAN), a deep learning algorithm that learns to generate realistic images from noise. And it is true, there are variations of GANs that can create deepfakes.
But the main type of neural network used in deefakes is the “autoencoder.” An autoencoder is a special type of deep learning algorithm that performs two tasks. First, it encodes an input image into a small set of numerical values. (In reality, it could be any other type of data, but since we’re talking about deepfakes, we’ll stick to images.) The encoding is done through a series of layers that start with many variables and gradually become smaller until they reach a “bottleneck” layer. The bottleneck layer contains the target number of variables.
Next, the neural network decodes the data in the bottleneck layer and recreates the original image.
During the training, the autoencoder is provided with a series of images. The goal of the training is to find a way to tune the parameters in the encoder and decoder layers so that the output image is as similar to the input image as possible.
The narrower the problem domain, the more accurate the results of the autoencoder becomes. For instance, if you train an autoencoder only on the images of your own face, the neural network will eventually find a way to encode the features of your face (mouth, eyes, nose, etc.) in a small set of numerical values and use them to recreate your image with high accuracy.
You can think of an autoencoder as a super-smart compression-decompression algorithm. For instance, you can run an image into the encoding part of the neural network, and use the bottleneck representation for small storage or fast network transfer of data. When you want to view the image, you only need to run the encoded values through the decoding half and return it to its original state.
But there are other things that the autoencoder can do. For instance, you can use it for noise reduction or generating new images.
Deepfake applications use a special configuration of autoencoders. In fact, a deepfake generator uses two autoencoders, one trained on the face of the actor and another trained on the target.
After the autoencoders are trained, you switch their outputs, and something interesting happens. The autoencoder of the target takes video frames of the target, and encodes the facial features into numerical values at the bottleneck layer. Then, those values are fed to the decoder layers of the actor autoencoder. What comes out is the face of the actor with the facial expression of the target.
In a nutshell, the autoencoder grabs the facial expression of one person and maps it onto the face of another person.
Training the deepfake autoencoder
The concept of deepfake is very simple. But training it requires considerable effort. Say you want to create a deepfake version of Forrest Gump that stars John Travolta instead of Tom Hanks.
First, you need to assemble the training dataset for the actor (John Travolta) and the target (Tom Hanks) autoencoders. This means gathering thousands of video frames of each person and cropping them to only show the face. Ideally, you’ll have to include images from different angles and lighting conditions so your neural networks can learn to encode and transfer different nuances of the faces and the environments. So, you can’t just take one video of each person and crop the video frames. You’ll have to use multiple videos. There are tools that automate the cropping process, but they’re not perfect and still require manual efforts.
The need for large datasets is why most deepfake videos you see target celebrities. You can’t create a deepfake of your neighbor unless you have hours of videos of them in different settings.
After gathering the datasets, you’ll have to train the neural networks. If you know how to code machine learning algorithms, you can create your own autoencoders. Alternatively, you can use a deepfake application such as Faceswap, which provides an intuitive user interface and shows the progress of the AI model as the training of the neural networks proceeds.
Depending on the type of hardware you use, the deepfake training and generation can take from several hours to several days. Once the process is over, you’ll have your deepfake video. Sometimes the result will not be optimal and even extending the training process won’t improve the quality. This can be due to bad training data or choosing the wrong configuration of your deep learning models. In this case, you’ll need to readjust the settings and restart the training from scratch.
In other cases, there are minor glitches and artifacts that can be smoothed out with some VFX work in Adobe After Effects.
In any case, at their current stage, deepfakes are not a clickthrough process. They’ve become a lot better, but they still require a good deal of manual effort.
Manipulated videos are nothing new. Movie studios have been using them in the cinema for decades. But previously, they required tremendous effort from experts and access to expensive studio gear. Although not trivial yet, deepfakes put video manipulation at the disposal of everyone. Basically, anyone who has a few hundred dollars to spare and the nerves to go through the process can create a deepfake from their own basement.
Naturally, deepfakes have become a source of worry and are perceived as a threat to public trust. Government agencies, academic research labs, and social media companies are all engaged in efforts to build tools that can detect AI-doctored videos.
Facebook is looking into deepfake detection to prevent the spread of fake news on its social network. The Defense Advanced Research Projects Agency (DARPA), the research arm of the U.S. Department of Defense, has also launched an initiative to stop deepfakes and other automated disinformation tools. And Microsoft has recently launched a deepfake detection tool ahead of the U.S. presidential elections.
AI researchers have already developed various tools to detect deepfakes. For instance, earlier deepfakes contained visual artifacts such as unblinking eyes and unnatural skin color variations. One tool flagged videos in which people didn’t blink or blinked at abnormal intervals.
Another more recent method uses deep learning algorithms to detect signs of manipulation at the edges of objects in images. A different approach is to use blockchain to establish a database of signatures of confirmed videos and apply deep learning to compare new videos against the ground truth.
But the fight against deepfakes has effectively turned into a cat-and-mouse chase. As deepfakes constantly get better, many of these tools lose their efficiency. As one computer vision professor told me last year: “I think deepfakes are almost like an arms race. Because people are producing increasingly convincing deepfakes, and someday it might become impossible to detect them.”