The reminder to not believe everything we see has never been more relevant than now. In the latest video manipulation advance, researchers have worked out how to turn written text into a realistic video of someone saying those very words.
We’ve seen this sort of deepfake trickery before, where clips of famous people are run through AI algorithms to get them to say just about anything, but the technology keeps on improving – and getting more difficult to detect.
The new system lets users make edits to the written transcript of a video, and then has those edits spoken back through layers of digital manipulation. It’s quite literally putting words in people’s mouths.
According to the researchers, this could be used to fix small problems in an acting performance in post-production, but they also acknowledge that their creation could be used for more sinister purposes.
“This technology is really about better storytelling,” says computer scientist Ohad Fried, from Stanford University. “Visually, it’s seamless. There’s no need to re-record anything.”
The newly created algorithm uses machine learning techniques to match words in a transcript with the movements of a talking head. At the moment it’s only been tested on videos showing people from the shoulders up, and it needs at least 40 minutes of sample footage to create a realistic-looking fake.
An intelligent smoothing mechanism is applied to make the speech seem more natural, then the resulting 3D model goes through a process known as Neural Rendering: this uses neural networks to bridge the gap between the 3D model and an actual face.
When shown to 138 volunteers, the finished AI-produced videos were rated as “real” almost 60 percent of the time.
In its most advanced form, what we have here is an ability to edit a talking head video almost as easily as you might edit a Word document, and the technology is only going to get more powerful and more accurate as time goes on. It can also be used with synthesised voices and to translate speeches between languages.
As we’ve seen over the last couple of years, various projects are now able to develop realistic-looking talking heads reading from a script, and soon the people on video may not even have to exist – AI can generate them too, given enough training data.
So what about potential misuse of the technology?
The researchers say they have considered this, although their solutions may not sound entirely convincing to everyone. They suggest that watermarking systems, improved forensic analysis, and better education and awareness around video manipulation could help viewers develop a healthy cynicism around the authenticity of video clips.
To bolster their case, they note that we’ve already learned to live with this when it comes to photo editing – knowing that pictures can be manipulated and faked to a very high standard.
“Unfortunately, technologies like this will always attract bad actors,” says Fried. “But the struggle is worth it given the many creative video editing and content creation applications this enables.”
The research is due to be presented at SIGGRAPH 2019 and published in ACM Transactions on Graphics; it can be viewed online at the pre-print server arXiv.org.