“It only took 2 weeks and 552 USD, I was able to make a deepfake video by myself.”

Tram Ho

Based on an article published in Arstechnica by technology reporter and master of computer science Timothy B. Lee.

Deepfake is a technology that uses a multilayer artificial neural network to disguise someone's face in a video. People are easily using this technology for malicious purposes and it is becoming more and more popular. Many articles have spoken to the community about the impact that deepfake can have on society.

But this is not such a post. Instead, we will go deeper into the core of this type of technology: How does deepfake software work? Is it difficult to use? And most importantly, how good is its masquerading ability?

I thought the best way to answer these questions was to explore the deepfake world by myself. I spent a few days groping for deepfake software along with a $ 1,000 budget to rent a cloud server to research. A few weeks have passed and this is the result: I am on behalf of Mark Zuckerberg in a questioning video before Congress with the face of Deputy Data Commander (played by Brent Spiner) in Star Trek: The Next Generation. I spent a total of 552 USD for this product.

The video above is not perfect, many details of Data are lost and the whole face still looks very fake. However, the surprise here is that I, a chicken, can easily create a video of someone else's face in a short amount of time at a relatively low cost. We have reason to believe that in the future, this type of technology will become even better, faster and much cheaper.

In this article, I will take you on the deepfake groping journey I went on, explaining the steps needed to create a face-fake video. Which, including the mode of operation of this technology and its limitations.

Deepfake needs a powerful computer and a large database

The name Deepfake originates from deep neural networks. Over the past decade, information technology scientists have discovered that artificial neural networks will become stronger with each added neural layer. And to be able to create a multi-layer artificial neural network, you will need a huge database and a powerful computer.

So, in order to make the video, I had to rent a virtual machine system using four expensive graphics cards. But even so, it still took me a week to train the machine with a large amount of data.

In addition, I also need to gather a lot of photos of Mark Zuckerberg's face and Data. The clip above is only 38 seconds long, but to produce it requires a lot of video presence of both. The goal is for the computer to practice. I downloaded fourteen videos from Star Trek: The Next Generation with Data and nine with Mark Zuckerberg. Zuckerberg's videos include speeches and interviews, including a clip of him grilling the meat in his backyard.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 2.

I dropped all this video into iMovie and cut out the clips without both. I also cut out long videos. Software deepfake not just many images, it also needs many different images. It needs to see the face from multiple angles, have a variety of expressions and be in different lighting conditions.

A long speech video of Mark Zuckerberg sometimes only has the value equivalent to the length of five minutes of the same video, because during the 60 minutes, the footage has the same angle, the same lighting conditions and expressive. So hours of video were cut down to nine minutes for Data and seven minutes for Mark Zuckerberg.

Faceswap: Deepfake software takes care of everything

Time to use deepfake software. I originally used a program called DeepFaceLab and created a fairly rough video. But when I posted it on a subreddit called SFWdeepfakes, a lot of people advised me to use Faceswap. They say the program has more features, better data recording capabilities and great online support. So I listened to them and switched to Faceswap.

Faceswap runs on Linux, Windows and Mac operating systems. It includes all the tools needed to create a video on behalf of, starting with the video you want to replace and ending with exporting a finished video. How to use the software is a bit confusing, but fortunately it comes with a very detailed user guide , explaining in detail each step of the whole process. The tutorial was written by the creator of Faceswap, Matt Tora, who helped me a lot when we were both chatting on Deepfake's Discord channel.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 3.

The Faceswap software requires a strong video card, I knew from the outset that the six-year-old Mac-Book Pro would not be enough. So I hired a virtual Linux machine from a leading cloud provider. I originally used a machine with an Nvidia K80 card with 12 GB of memory. After a few days, I upgraded to a model with two cards and then upgraded to four. The last machine has four Nvidia T4 Tensor Core cards with 16 GB of memory (it also has 48 vCPUs and Ram memory of up to 192 GB, but it is not used much because the neural network only focuses on the capacity of the video card). Figure).

Over two weeks, the rent amounted to 552 USD. Matt Tora told me that the best current video card for deepfake is Nvidia GTX 1070 and 1080 with at least 8 GB of memory. I could buy a card of that size for a few hundred dollars. However, a 1080 card will not be as fast as the four cards I use. But the results that both bring will be the same.

The process of using Faceswap involves three steps:

– Split image: split video into frames one by one, identify faces in each frame and cut them out.

– Practice: Using the images taken to train a multi-layer neural network – the ultimate goal is to replace the image of a person's face and replace it with the image of another person's face that has the same postures, expressions and light reflexes.

– Conversion: Using the network from the previous step and put in the clip, create a video deepfake. After having successfully trained the neural network, it can be used on any video in which two subjects have been used.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 4.

The length of these three steps is very different and it is also required by different users and computers. The software extracts information from the video in just a few minutes, but it takes several hours for the "computer to run" to check the results. The software recognizes all faces in the frames, whether or not they are the faces of the objects that they want to change. To achieve satisfactory results, users must view the cropped images one by one and erase faces that are not objects or certain images that the computer mistakenly recognizes as faces.

On the contrary, the training process is easy to set up and does not require much user supervision. However, it takes days, even weeks, for a computer to produce good results. I started practicing my last sample from December 7 to December 13. If I had it run for another week, the quality of my deepfake video would be even better. And that's me using a terrorist virtual machine with four high-end graphics cards. If you use a personal computer with a weaker graphics card, training a good model can take weeks.

The final step is conversion, it does not take much time and requires very little on both users and computers. Once you have the right network, exporting a face-fake video can take less than a minute.

How does the deepfake program work?

Before going into my groping of Faceswap, let me explain to you how this kind of technology works first.

The core of Faceswap, like other deepfake software, is an automated coder. The code set is a trained neural network, the purpose of which is to compress the input image and output a similar image. Although it may not be helpful to hear, this is an important element to creating a video on behalf of.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 5.

The automatic coder is constructed like two funnels with the narrow end attached. One side contains the encoder and is responsible for taking images and forcing them down into variables – with the Faceswap software I use, it produces 1024 real 32-bit floating-point values. The other side of the encoder contains a decoder. It takes the role of compressing variables, called "hidden spaces," and stretching them to resemble the original image.

By limiting the amount of encoder data given to the decoder, the software forces both sides to develop a set of variables that contain details of a human face. You can imagine the encoder as an incomplete compression algorithm – it tries to record a lot of face information into a limited storage space. This hidden space must record details such as the direction the subject is facing, eyes open or closed, or the subject wincing or smiling.

But most importantly, the encoder needs to record a person's changing mood over time. It does not need to record fixed details such as eye color or nose shape. If all the photos of Mark Zuckerberg have blue eyes, the decoder will automatically output the blue eye color, without the encoder writing it into the hidden space that doesn't have much space. This is an important factor in the process of creating deepfake videos.

Every algorithm used to train the neural network needs many measures to assess the condition, so that it can improve performance. In many situations, people use a method called supervised training, which will provide standard answers for each array of data in the training set. But the automatic coder is different, because it only tries to recreate the data entered, so the training software can automatically evaluate the current performance. In jargon, this is called unsupervised practice.

Like every other neural network, Faceswap's codebase uses a traceability algorithm to practice. This algorithm takes a specific image to the network and finds out which pixel of the output does not match the input. It will then calculate which nerve in the last layer makes the most errors and will correct it so that the output will be improved.

The system will trace the fault back to the previous layer to correct the nerves again. The process will continue over and over again, until all values ​​of the neural network, both at the output and the input, have been adjusted. Once you're done with one image, the training algorithm will give the neural network another image, and the process above starts again. In order for the code to work properly, the retrieval process can be repeated hundreds of thousands of times.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 6.

Deepfake software works by practicing two sets of code in parallel, one for the original and the other for the compound face. Each set only analyzes a person's photo and is trained to be able to reproduce an image similar to the image entered.

However, it is worth noting that both neural networks use the same encoder. Only the decoder, composed of the nerves on the right, is separate and each has the goal of recreating two different faces. The nerves on the left, because they share values, will be influenced by both processes. When the Zuckerberg face analytics network, it will also make a change in the face analytics data network. Each time the Data analytics network analyzes the Data face, the Zuckerberg network inherits the changes that have been made to perfect the output.

As a result, the two coders own the decoder capable of reading data on both faces, Zuckerberg's and Data's. The goal here is for the decoder to share the information about aspects such as the angle of the face and the position of the eyebrows, whether it's Zuckerberg or Data. That means that after compressing the image into a variable, one can use one of two decoders to develop the final product.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 7.

Once you've trained the two encoders as above, the next step is quite simple: you just need to change the decoder. You encode an image of Mark Zuckerberg but use the Data decoder to make a false face. As a result, we get a picture of Data but we have the facial expression of Mark Zuckerberg.

Remember that the hidden space only records information of the spirit, including facial expressions, face direction and eyebrow position, while facial details such as eye color or nose shape are reproduced in the decoder. This means that if we encode Mark Zuckerberg's face and analyze it with the Data decoder, we get a face that has the same fixed details of the Data (like the shape of the face) but has a god. identity of Mark Zuckerberg.

If we apply this technique to frames of videos with Mark Zuckerberg present, we get a completely new video with the face of Data but perform Zuckerberg's actions – such as smiling, winking or turning heads. .

More significantly, this is a symmetrical situation. When we train a network that encodes Zuckerberg's face and decrypts the face of the Data, we can simultaneously encrypt the Data's face and turn it into Zuckerberg's face. When using Faceswap, during the final conversion process, there is a box called "swap model" that we can mark so that the software can change the decoder. So, instead of representing Data on Zuckerberg, we can do the opposite. The result is the video below:

The robot, called Data, has a Zuckerberg face, and confesses that the machine also knows.

Data to train neural networks

In fact, to create satisfactory results with deepfake software is not easy.

As mentioned, I collected seven minutes in Data and nine minutes in Mark Zuckerberg. I then used Faceswap's decompression tool to split the video and take a photo of the two present. The video has a frame rate of about 30 frames per second, but I only took 1 of the 6 frames – the Faceswap reference advised me to do so. The reason is because the difference in the photos is more important than the number of photos, and if I take all the frames, I get only a bunch of identical photos.

Faceswap's decompression tool creates a lot of confusion. It sometimes recognizes the faces that appear behind Mr. Zuckerberg. So I had to spend hours assimilating to delete those photos that were not of the two subjects. At the end of the process, I got 2,598 photos of Data and 2,224 photos of Mark Zuckerberg. By this time, I was able to train my deepfake system.

Currently, Faceswap has ten algorithms that support different image sizes and require different computing power. For weak machines, there is a "lightweight" algorithm used for photos with a side length of 64 pixels and it can be run on computers using only 2GB VRAM card. There are many other algorithms that are suitable for images with 128, 256 or even 512 pixels, respectively – these algorithms require much more memory and are much more time-consuming.

I originally used an algorithm called DFL-SAE, derived from the DeepFaceLab software. However, Faceswap's reference recommends that this algorithm has an error called "identity bleeding," meaning that the details of one face can be mixed into the other. So after a day, I switched to another algorithm called Villain, compatible with 128 pixel photos. The document describes that this algorithm is "very high in VRAM memory" and "a good option if you want the network to have high resolution without having to edit the settings" .

So I kept waiting. The training was still running after six long days, but the deadline the superiors had come for. At that time, my network was able to disguise my face quite well. The process was slow, but if I let it run for another week, it would be much better to be disguised.

Faceswap software is very well designed in the use of computers for a long time. If you train the network using a graphical user interface, the interface will often update to a preview of the replaced face. And if you like practicing with commands, that's okay. Faceswap's interface supports creating the commands you need to train your neural network with the settings available.

How good is the face detection technology?

During practice, Faceswap constantly displays an indicator called the level of detail loss. These figures indicate Zuckerberg's ability to recreate face images and Data. The index is still falling when I stop the process by the deadline. However, it seems that the index dropped much slower than the original.

Naturally, the important thing here is the ability to convert Zuckerberg's face to the Data face of the Data decoder. We don't know what the pictures on our face should look like, so we can't judge them. We can only look at the result and decide if it looks real or not.

The video above shows the degree of facial disguise at four different times. On December 10 and 12, the video shows the face trained by the Villain algorithm. Day six is ​​a demo trained by another algorithm. And in the lower right corner is the final result. The longer the practice, the more detail on your face becomes clearer and more lifelike.

On December 9, after three days of practice, I posted a deepfake video on Ars Technica Slack. The clip is quite similar to the result on the tenth in the video above. And the graphics master of Arstechnica, Aurich Lawson, commented very negatively.

"Overall it's very bad," he writes, describing it as "not convincing. I have never seen a video that looks deepfake not fake at all."

I think he was right in part. I was surprised to see how quickly Faceswap could replace Zuckerberg with Data's face. But if you look closely, the signs that the edited video exists are clear.

In some frames, Data's face doesn't match Zuckerberg's head. We may see Zuckerberg's eyebrows occasionally showing up below the Data face. Elsewhere, the edge of the Data face hides part of Zuckerberg's ear. These problems can be fixed if the user spends time adjusting the video: someone has to look at each frame and fix the face to match.


However, one more important problem is that deepfake algorithms are not very good at creating good facial details. You can see this very clearly if you look at the initial video and after on behalf of. Faceswap creates very standard face structure of Data. But even after a week of practice, the face still looked blurry and the essential details were missing. For example, deepfake software seems to have trouble reconstructing human teeth consistently. Having just seen the teeth but only a few frames later, the oral cavity will be black and no teeth left.

A major cause of this situation is that it becomes very difficult on the high resolution on behalf of. The automatic encoder reproduces a pretty good 64×64 pixel picture. But recreating details at 128×128 resolution, let alone 256 pixels or higher, is a huge challenge. This is probably the reason why the most impressive deepfake videos often have a wide angle of view rather than a close-up of someone's face.

But you do not need to worry that this is the limit of facial recognition technology. In the future, it is likely that researchers will develop techniques to overcome this limit.

Deepfake software is often misrepresented that it is based on Generative Adversarial Networks (GAN), a type of neural network that helps software "imagine" people, objects, or the scene does not exist. Deepfake is actually based on an automated code set, not a reverse network generated. But recent developments in inverse network technology have created many avenues for deepfake development.

When it was first published in 2014, GAN could only produce rough, low-resolution images. But recently, researchers have figured out a way to design a GAN network so that it can produce realistic images with resolutions up to 1024 pixels. The technique used in the GAN technology may not be compatible with the automatic coder, but someone can develop a similar technique specifically for the encoder – or even a mechanism. New neural networks only to apply for the behalf.

Watch out for deepfake

The rise of facial disguise technology is slowly becoming a concern. Until recently, we could still trust video content in someone's face. But with the existence of deepfake and other digital tools, we must always be skeptical about the accuracy of any photo or video. If we see a video of someone saying disreputable things, or taking off their clothes, we must consider the possibility that another person has intentionally harmed the object in the video with technology on behalf of them.

But with my testing, we have clearly seen the limitations of deepfake technology in the present. It takes a lot of knowledge and effort to be able to create a convincing virtual face. I naturally failed and I'm not sure if anyone has created a deepfake video that viewers cannot distinguish between fake and real.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 11.

Moreover, tools like Faceswap only on behalf of. They do not change the forehead, hair, arms or legs. So if the face looks perfect, we can still determine if the video is real or fake based on other factors.

However, the limitations of deepfake will most likely disappear. Only a few years on, the software can possess the ability to disguise someone's face that viewers can not recognize the fake. What if that happens?

In this case, I think we should remember that a lot of other media have been able to be fake for a long time. For example, it is very simple to create a screenshot of an email, where the content is completely fabricated. But no life has been ruined just by a fake email. And they still have the ability to testify in public talks.

People know that email can be faked, so they have to investigate outside evidence. What got the letter noticed? Has anyone else received a copy of the letter by the time it was written? Did the person who wrote the letter admit that they wrote or insisted it was fake? Questions like these help people determine the authenticity of an email.

Trick once

The same goes for videos. The odds are small that a fraudster will destroy someone's life by spreading a video they say or doing crazy things. But very soon, the public will learn to doubt everything they see in a video. They will know how to examine it carefully, learn about witnesses, the chain of events or use any other means of authentication.

Chỉ mất 2 tuần và 552 USD, tôi đã có thể tự làm ra được một video deepfake - Ảnh 12.

Tôi nghĩ rằng vấn đề này cũng hoàn toàn đúng đối với sự lạm dụng xấu xa nhất của công nghệ deepfake: ghép mặt ai đó vào một video khiêu dâm. Điều này rõ ràng là sai trái và khinh rẻ nhân phẩm người khác. Rất nhiều người đã cố gắng nâng cao nhận thức của công chúng và cho họ biết rằng những video như thế này có thể phá hủy danh tiếng và sự nghiệp của bất kì ai. Thế nhưng tôi nghĩ rằng quan niệm này chưa hẳn đúng.

Dù sao thì, kể cả bây giờ, trên mạng đã đầy rẫy những tấm ảnh Photoshop giả tạo mang khuôn mặt của những ngôi sao nổi tiếng ghép với thân thể của các diễn viên khiêu dâm. Và điều này đương nhiên khiến cho các chị em phụ nữ rất phiền muộn. Nhưng công chúng không bao giờ kết luận ngay là người trong hình đã chụp những tấm ảnh khỏa thân này cả – đơn giản là vì chúng ta biết Photoshop tồn tại và có thể bị sử dụng để làm giả ảnh.

Điều đó cũng đúng với các video khiêu dâm deepfake. Đương nhiên là chẳng có gì hay ho khi bạn là đối tượng bị ghép mặt trong một video khiêu dâm. Nhưng sự xuất hiện của những video giả tạo này còn xa mới nghiêm trọng như một video "hư hỏng" thật sự của bạn bị rò rỉ. Thiếu đi chứng cứ và tính xác thực, công chúng sẽ dễ dàng nhận ra nó là giả.

Matt Tora, nhà lập trình của Faceswap, đã nói với tôi rằng lý do trên chính là một nguồn động lực to lớn của anh trong việc sáng tạo phần mềm thay mặt. Anh ấy tin rằng sự phát triển phần mềm giả dạng khuôn mặt là điều tất yếu. Anh hi vọng rằng việc tạo ra một phần mềm thân thiện với người dùng và có mã nguồn mở sẽ giúp làm sáng tỏ loại công nghệ này, đồng thời giáo dục cho công chúng biết khả năng và giới hạn của nó. Từ đó, giúp cho xã hội học được cách nghi ngờ tính chất xác thực của mọi video.

Về lâu về dài, điều này có thể khiến công chúng hoàn toàn mất đi sự tin tưởng vào những chứng cứ tồn tại dưới dạng video. Và với sự phát triển của công nghệ hiện nay cùng với nhận thức đang ngày một nâng cao của mọi người, nó hoàn toàn có thể thành sự thật.

Share the news now

Source : Trí Thức Trẻ