A threat to democracy (manipulation of public opinion, aggravation of social or community-based tensions, etc.), invasion of a person’s privacy or violation of their dignity, a risk of fraud or scams, or still a headache for future researchers seeking the truth, etc. Deep fakes, technologies that make it possible to replace someone’s face with that of another person in a video thanks to deep learning, are worrying. Ever more sophisticated, today they are pretty much available to all thanks to relatively easy-to-use tools.
A chat about these “weapons of mass falsification” with Vincent Nozick, a teacher and researcher at the LIGM and co-author of a publication presenting an efficient method to detect deep fakes (MesoNet: a Compact Facial Video Forgery Detection Network, Darius Afchar, Vincent Nozick, Junichi Yamagishi, Isao Echizen, 2018).
Deep learning is now being used to tamper with faces in videos. How do deep fake technologies work?
There are several methods for tampering with faces, some of which do use deep learning, like Deepfake, which is one of the most well-known. Deepfake is a program that belongs to the GAN family, Generative Adversarial Networks, and makes it possible to transfer facial expressions onto video. It is based on an autoencoder, which is made up of an encoder and a decoder.
An autoencoder is an artificial neural network that is fed with different photos of a person’s face in which their facial expression, their position, the lighting, texture, resolution, etc. change. The encoder is asked to encode these data in a limited number of parameters. In effect, an encoder is like a funnel; each layer is smaller than the previous one and contains fewer and fewer neurons. When we get to the end of the encoder, there are only a thousand neurons, which is very little compared to what we had to start off with.
We then ask the decoder, which is like an upside-down funnel (it starts off with a small number of neurons – the same as at the end of the encoder – and increases gradually), to use these one thousand parameters to generate a face that is identical to the original face. Training a neural network therefore means to train it to reproduce a face as accurately as possible by itself. In fact, it learns to compress and decompress the face of one person in particular.
In the case of deep fakes, we take the faces of two different people, A and B. The great idea was to say: rather than A and B each having their own autoencoder, they will share the same encoder whilst each having their own individual decoder. During the production of a deep fake, the encoder will encode A’s facial data but, instead of decoding them with decoder A, we’re going to do it with decoder B. In doing this we put B’s face onto A.
Deep fakes are often presented in a way that is alarming. Is it really that easy to use this technology? Is downloading an application all it takes to create a deep fake?
It’s true that there are applications, notably FakeApp (based on TensorFlow, an open source machine learning tool developed by Google, Ed.). No IT knowledge is needed to use it, all it takes is to follow some key steps. You start by creating a database of the source and target people. Anyone can do this. If I want to create a fake video of someone, I collect as many photos and videos as possible of this person that are of good quality and with a large variety of facial expressions, light, etc. Then some parameters must be chosen. At this stage, people who are used to creating deep fakes say that experience plays an important part in identifying the most relevant parameters. The software must then be run and trained on a machine with a good graphics card. This takes around half a day and only requires a small investment of time. This would therefore be very easy for a team of professionals who wished, for example, to manipulate an election. Creating a malicious deep fake amounts to a lot less effort than many other more sophisticated attacks.
You write that most forensics techniques (which rely on mathematical tools with no machine learning) that are used to analyse images and detect fakes are ineffective on videos. Why is that?
The problem is that on an image, on a photo for example, there is a lot of information, in particular “noise” (slight imperfections of the image). A video is a succession of images but on which there is a lot less “noise” because it has been absorbed by very high compression. If they weren’t compressed so much, video files would be much too large… In video there are many ways to compress that often quite different from one another, but 99.9% of the time, it is into JPEG. So there isn’t just one standard way of doing this, which makes it highly complex to develop a detection method that works on all videos.
With MesoNet, you propose to use artificial neural networks at the mesoscopic scale. How does this deep fake detection tool work?
In our previous work, we used deep learning to distinguish synthetic images from photos. An image’s “noise” turned out to be a very good indicator. This “noise” is observed at the level of the pixel, at the microscopic scale. But in the case of deep fakes, there is hardly any left because of the video compression.
As for the macroscopic level, this consists in analysing an image as a whole to see if it represents a human, an animal, a building, etc. This is image semantics. In this case, we already know that it is a face. Our starting point therefore was to say: the microscopic scale doesn’t work, we aren’t interested in the macroscopic scale, so let’s go in between these, at the mesoscopic level; let’s take the image’s parts, not a single pixel or the entire image.
This helped us to define our neural network. We knew that we had to feed it with mesoscopic data. From there, we built a database and the network’s design. By carrying out several tests, we were surprised to note that medium networks worked better than deeper networks.
When modifying our network to increase its performance we noticed that each time we shortened it, it didn’t get weaker and sometimes it got better. The network that worked best was therefore quite short in the end, with around twenty layers.
This has two benefits: the first is that it is easy to train, this takes about two hours on a normal machine. Once it has been trained, we can use it on a low-power machine, like a smartphone. The second benefit is that we can explore and study it. Thus, by analysing the various layers of our network, we realised that the eyes play an important role in the detection of deep fakes.
The recent example of the deep fake of Donald Trump giving Belgium advice on climate change shows that the solution for detecting deep fakes isn’t only technological. In a context where deep fakes are becoming more and more sophisticated, how can human beings be “trained” to spot them, or at least to be wary of them?
When you check your social networks, if you see a photo that seems totally unrealistic, do you believe that it is genuine or do you think straight away that it is a fake? I think that most people have learnt to be wary of images. It is something that took some time, but that has become automatic. For the moment, we tend to believe what we see on videos that appear on social networks. But as with images, we will learn to remain vigilant. A period of time during which we will need to be educated will certainly be necessary and essential, be it via self-education or via awareness campaigns. I find it interesting that there are lessons in “zététique”, i.e. instructional studies that enable critical reflection and investigation in some schools. It is about teaching children to develop their critical thinking, to exercise caution regarding the information they receive whilst striking a balance between believing everything they are told and remaining completely sceptical, and how to achieve this.