Deepfakes, Blockchains, and Factom

Introduction

Living in a world where it’s impossible to tell whether or not a recorded video is real sounds like a nightmare but with the advent of Deepfake, that world has been heralded by many in recent times. The question of what to do about it is asked almost daily in the Factom Protocol community but the answers, both in our community and elsewhere, have been sparse.

Tackling Deepfakes is an extraordinarily difficult problem and, unfortunately, I have no easy answers. I do, however, have some expertise and a lot of interest in the area. The goal of this blog is to present the full scope of the problem, of which Deepfakes is only the latest iteration, and explore the role of blockchains (and Factom in particular) in these areas.

What is Deepfake

The TL;DR is that Deepfake is a deep learning algorithm that is capable of learning an individual’s gestures and speech patterns from a video and then synthesize new movements or superimpose a different video. It is not an entirely new technology, however.

In the 1994 movie Forrest Gump, the titular character meets and interacts with President John F. Kennedy, over thirty years after the latter’s untimely demise. The effect was achieved by filming Tom Hanks in front of a blue screen and placing him inside of heavily edited archival footage. [Link] Despite being artistic in nature, this is very similar to Deepfake, created not with actors and makeup but by manually creating the composite from existing video. We know it’s fake because Tom Hanks, who was seven years old in the year JFK died, could not possibly have filmed that scene and we generally hold the expectation that movies are not real.

This practice is very common in Hollywood, with famous deceased people brought back to life and other impossible things like spaceships battling in space. Even before modern processing power was able to create entire worlds, artists have manipulated video with rotoscoping, invented in 1915, and matte painting, 1907. The history of motion picture special effects goes all the way back to 1895.

Left: Actress Loren Peta with motion tracking dots
Middle: A computer render of Sean Young
Right: The end result, a very convincing digital fake of Sean Young
(Source)

It has been possible for a long time to fake video, so what is the cause for the Deepfake epidemic? That answer is easy: the threshold for doing it. Techniques like the Forrest Gump scene require a lot of work to accomplish, with painstaking labor going into manually performing the process. It took them months. With modern technology, it’s easier but the effort involved is still high, often using video that was specifically shot in order to be manipulated. Yet with Deepfake, the ability to create these is as easy as pushing a button. It’s a tool available for the public that doesn’t require decades of industry experience.

Video is not the only medium affected by this. Photography has been around for around three centuries and there are numerous accounts of image manipulation throughout history. Some have been attempts at hiding perceived flaws, such as editors removing a pole from the 1971 Pulitzer prize winning Kent State Massacre photo, and some have been attempts at censorship, like in 1937 when Nikolai Yezhov was removed from a picture of Stalin.

It wasn’t until the nineties when personal computers skyrocketed that the public had widespread access to image manipulation software. “Photoshopping” is a household phrase these days with a plethora of tools available and nearly every picture in marketing or fashion is manipulated.

Orson Welles’ War of the Worlds, a 1938 radio drama structured like a real news program, led some of the listeners to believe that Earth was being invaded by aliens. Art forgeries date back millennia and counterfeit currency is as old as currency itself. Whether it’s gold, oil, or pixels, the idea is the same: to create a forgery that tricks people.

The good news is that society won’t suddenly end with the advent of Deepfake. It’s a problem that society has encountered plenty of times and we always endured. The bad news is that no one has yet come up with a definitive solution.

The Two Problems

There are two distinct facets of the overall Deepfake problem:

  1. Can you prove that a video is yours and has not been altered by anyone else?
  2. Can you find out whether or not a video you see has been altered by a third party?

Most of the attempts at a solution I have seen target the first question and that is something that is surprisingly easy to solve. The second question, unfortunately, is almost insurmountably hard.

Solution To The First Problem

Let’s take, for example, an interview conducted by a news agency, which for the sake of complexity is shot entirely in digital. Upon conclusion of the interview, you are left with the raw data and you want to ensure that neither the interviewee nor the interviewer will manipulate the facts afterward. So you calculate the hash of the raw data file and then both the news agency and the interviewee sign the hash cryptographically. You then store the hash and the two signatures on a blockchain.

That’s it. Should a Deepfake version of the interview surface, either the news agency or the interviewer is able to prove in court that the video has been altered by showing up with the raw data and proving that they control the private key that signed the raw data. Independent experts will be able to confirm the claim.

This is already possible for certain types of documents with existing Factom Protocol solutions like Off-Blocks. This solution lets multiple people sign a document using a variety of identities.

Another solution is IOTSAS – Signed at Source. This is a hardware-based solution that is able to be integrated with, for example, cameras to automatically sign the data. It would not be possible for a Deepfake video to also fake the signature, meaning you can easily show which video was recorded by the camera and which one was not. The additional benefit of an automated hardware solution is that you can expect entries to happen automatically inside certain timeframes. Gaps or delays in the recorded data can then be interpreted as signs of deletion or suppression.

The applications of both of these extend well beyond Deepfake into other kinds of forgeries such as legal contracts or scientific data from weather buoys.

The Second Problem

Let’s assume that there is a video and you want to know if it has been altered in any way. The ideal would be if there was a database you could use to look that up but that’s already where we run into the first roadblock. It’s only possible in limited circumstances. Let’s go on a detour of video encoding.

Excursion: Digital Video Encoding

Video data contains vast amounts of information, which in turn takes up a lot of space. For that reason, video is usually encoded to a “lossy” format, meaning that transforming it from the source produces a result that contains less information than the source. To the human eye, those losses are usually imperceptible at higher qualities. At lower qualities (and therefore lower sizes), they appear in the form of compression artifacts.

Throughout the lifetime of a video, it typically undergoes several of these transformations.

For example in satellite television, the original programming feed goes to a satellite, then to a direct broadcast station, to another satellite, arriving at a satellite dish, where it passes through a low noise converter to a receiver that descrambles the video to let your TV play it. Each step is affected by the atmosphere or particles in the air and can produce transmission errors.

A simulated transmission error (Source)

It’s a little more predictable for something like game streaming. The source video is taken from the graphics card and compressed with a codec such as H.264. The data is sent to an ingest server, such as Twitch, which then transcodes it from the original to multiple other formats. The result will be replicated to various data centers around the globe. Viewers receive the stream from those data centers, decode it, and finally display it on your monitor.

YouTube videos are completely re-encoded by Google to their own video codec in various qualities and sizes.

By the time the video arrives at your TV, computer, or other devices, it is no longer identical to the original.

Return to Digial Signatures

In order verify a digital signature such as the ones typically used in the blockchain world, a computer needs the signature itself, the original data, and the public key component of the key that produced the signature. But due to video encoding, you are unlikely to have the original data even in ideal circumstances. If so much as a single bit in the video changes, the signature is no longer valid.

It gets even more difficult under less-than-ideal circumstances. For example, someone uses a projector to throw a video onto the surface of a building. You might not have access to the video data at all or only cell phone recordings of the building.

That’s how the problem of “we need to find the signature that matches this video” to “we need to find the signatures of videos that are similar”. This is not possible using hashes and will require other methods.

Solution from Technology: Content ID

Content ID is a service offered by Google to discover copyright violations. It does this by taking a reference video or audio sample and creating a fingerprint from it. The exact workings of the algorithms are unknown but a simplified version of this process for videos is as follows:

Take the reference and scale it down, which will suppress finer details, then save several different sizes as “fingerprint”. When you have another video, you also scale it down to the smallest fingerprint size and compare the two for similarity. If the similarity crosses a certain threshold, you compare it to the next larger size, and so on, depending on how accurately you want the comparison to be. Additionally, the system will have to identify video samples that are reversed, upside down, have modified colors, etc. These can also be partial matches, such as the reference video being just the logo of a station.

Google has spent over $100 million developing this system, so their algorithm is likely much more refined than the example I described. The end result is that they are able to match user-uploaded video to a set of registered reference videos, even if it’s just a partial or somewhat modified match.

The components for a solution seem to exist. Video creators digitally sign their content with a system like Off-Blocks. In order for third parties to verify if a video has been modified or not, they can submit recordings (such as cell phone recordings) to a Content ID-like system that will try to find the original.

This works if the Deepfake copy is similar enough to the original, such as an interview where a few words have been edited out. It might still be sufficient enough to detect face swaps but if the changes are significant enough, such as the Forrest Gump example from above, the system would simply be unable to look up an existing record. This also works in practice on YouTube, where pirated media can be found that use a visual distortion (such as a screen glare) to change the video enough to bypass the Content ID while still being watchable by humans.

Keeping enough fingerprints and tweaking the algorithm to detect such changes have a dramatic impact on data and computation requirements. It would also be very challenging for a computer to determine whether a difference is a malicious modification or simply a transmission error or other kind of difference stemming a cellphone, like the texture of the wall or shadows.

Solution from the Art World: Provenance

It is possible but expensive to authenticate art. There are physical characteristics in, for example, oil paints that are hard to forge. Brush strokes can be analyzed by experts, paint can be carbon dated, canvases can be x-rayed, and so on. These methods are not exact and can require costly machinery and techniques as well as physical access to the medium, driving up the costs, typically reserving it only for very valuable paintings.

That’s why provenance exists: rather than verifying whether or not the painting itself is real, you keep a ledger of ownership. First-hand transfer of ownership might come in the form of a certificate of authenticity signed by the artist themselves. Every time the painting is sold or gifted, another entry is created.

A painting is considered authentic when it is owned by the person or institution indicated in the provenance, rather than a physical analysis of the painting. In real life, this is fairly hard to establish and these ledgers may not be publicly available. An example of such a provenance can be seen in the painting Bride, by Marcel Duchamp from the Philadelphia Museum of Art, which is presumably backed by the physical evidence.

Applying this concept to the digital world, provenance can be a form of authorized distributor. Since digital data is infinitely reproducible, each authorized distributor could in theory issue further authorizations. Returning to the interview example, a news station can authorize other stations to use their footage. Other stations can then further authorize websites or specific YouTube channels or Twitter accounts. This ledger of authorization would have to be publicly accessible and the video would have to have meta-data attached to it.

Blockchains are particularly suitable for this kind of data given its immutability and public accessibility.

The caveat is that digital provenance would not prevent anyone from copying videos. It would fall on the viewer to look up the provenance of the video and ensure that the broadcaster is authorized to broadcast it, and if they are not, to consider the content of the video as not trustworthy. If there is no metadata attached to the video, we are back to the second problem.

An already existing example of this type of system is SSL Certificates. When you navigate to an “https” website, the browser will automatically go to a central authority (the “ledger”) using the website’s domain as meta-data. From there, the browser establishes that the IP address of the server is authorized in the digitally signed certificate. If something is wrong, the browser will prevent you from loading the website.

Solution from Technology: Forensics

This is a two-parter. The first part is specific to Deepfake: Adobe and UC Berkeley’s recently published tool to detect Photoshopped images. This is a neural network capable of detecting areas of photos that have been manipulated and a similar concept could be applied to Deepfake videos.

An example of automated Photoshop detection (Source)

This is possible because the tools used to perform these changes aren’t random but follow a specific algorithm and this algorithm potentially leaves fingerprints on the data. I’m not too familiar with the inner workings of the deep learning algorithm but the idea is to run it in “reverse.” Train an algorithm on fake videos in order to learn what distinguishes a fake video from a real video, then the tool can automatically detect Deepfake videos.

The caveat is that this solution is extremely specific to Deepfake and not altered video in general. If another similar-but-not-quite-identical Deepfake algorithm emerges, the tool would have to be trained on the new data.

The second part is human experts and is more about general video manipulation, not Deepfake specifically. While the average person doesn’t have the trained eye to spot fake videos, experts in the field are likely able to discern the tricks used to fabricate the videos. One such expert is the YouTuber Captain Disillusion who analyzes and debunks viral videos.

The caveat is that experts are not able to verify all video on demand but their job would be more about debunking high profile videos and informing the general public. Captain Disillusion not only points out flaws in fabricated videos but goes to great lengths to explain the concepts and techniques used to create them. People who know about these tricks are less likely to be fooled by them in the future or at least be more critical.

Solution from Libraries: Archivists

The biggest problem with solutions to Deepfakes is the inability to look up video. Humans have invested a lot of thinking and energy into creating ways to look up books and documents, like the Dewey Decimal System. These can be used to discover existing media from only limited information, such as the name of an author.

There is no such equivalent for all video and finding the right video, even if you remember who was in it and what it was about, can be challenging unless you know the right title. There is typically no way to find videos by ancillary data such as the amount of people in the video or the color of the sky.

Creating an archive would involve hiring humans to catalog videos based on criteria such as the time it aired, the names of the people in it, colors, objects, languages, etc. If a Deepfake video of a news anchor surfaces, third parties could look up the segment in the archive based on the date, name of the anchor, or even the clothes the person wore (for example, “dark suit and blue striped tie”). This process would extend beyond just automation, involving the expertise of the archivists themselves who may just be able to remember these videos.

The caveat of this solution is that it will be a lot of work to archive media but real life has already proven that these archives are invaluable. Examples include newspaper archives, the Library of Congress, the Internet Movie Database, and Internet Archive. The latter includes the Wayback Machine, a tool that archives snapshots of websites you can look up to detect alterations over time and have already been used to detect fraud (such as disappearing promises made by companies or warrant canaries).

Solution from Technology: Evolution

I already touched on this earlier: Deepfake has lowered the threshold required to alter video and this has already happened many times throughout history. Yet there is a gap between the invention of a technology and the threshold for manipulation the technology sinking low enough for it to be a problem. Digital video was introduced in the late seventies and early eighties but Deepfake was only introduced recently. The first digital image was made in 1957 but Photoshop didn’t gain popularity until the nineties. Money has been a target of counterfeiters since time immemorial and every time they become reproducible, they are reworked to be harder to forge, such as hologram strips, or very specific material, such as USA’s cotton fiber.

As technology advances, the ability to manipulate them easily won’t be possible for a while. That means we have to accept that regular video, just like digital images, is just not that trustworthy. Instead, we can move on to newer tech, like 3D video (stereoscopic) and 180°/360° video (virtual reality). This type of technology is significantly harder to alter than plain video and has already been seen in mobile phones.

If we look to the future, we can take technology that captures more than just the visible light: a broad spectrum camera. Perhaps sensors that capture smells. Scanners that can scan 3d space to create holographic data. All of these already exist as individual components but are not easily accessible or require a lot of setup. Advances in mobile technology could make them available to the public in the future.

With every introduction of new technology, the threshold for manipulation rises again.

Solution from Technology: DRM

DRM, or Digital Rights Management, is a broad classification of techniques used to prevent the copying of media. It is typically employed to combat piracy in the music, movie, and video game industries. It has been the holy grail of copyright holders for a long time but has never been successfully implemented.

So why am I mentioning it? Because it targets exactly the reason that Deepfake has caused such an uproar: the threshold of being able to do it. DRM will always be circumvented by people knowledgeable enough but the goal is to just reduce the number of people doing it by making it harder. DRM exists as both software-only (such as Windows Media DRM) and hardware-software hybrids (such as Keurig’s 2.0).

If video data is protected with DRM, the Deepfake app will not be able to read the video files if they are downloaded directly from the source, such as a YouTube downloader extension. Someone would have to create a way to automatically bypass the DRM, which creates an additional requirement for it to be used in Deepfake, locking out less knowledgable people.

The huge caveat is the general truth of copyright: if it can be seen or heard, any form of copyright protection can be bypassed. Any easy way to circumvent it would be a screen recorder or simply taking a cell phone recording of the TV.

Another problem is that adoption of DRM technologies are hard to achieve. Support for the new format would have to be implemented in every device you want to target. In order to add DRM to YouTube videos, for example, you would have to change all major browsers (there are more than 5), as well as apps for TVs and other media operating systems.

It’s very high-cost and high-effort for only a marginal gain.

Final Thoughts

The Deepfake problem is definitely a problem and there are potential solutions to stymie the worst aspects of it (proving authenticity in a legal framework) but I don’t foresee a way to return to a pre-Deepfake world. The Deepfake problem is, however, not new. Humanity has grappled with forgery and deception for a very long time and Deepfake is just the latest iteration in a long line. Even without Deepfake, we cannot take the authenticity of video for granted — the truth is that video manipulation has been a common practice for decades.

That doesn’t mean we shouldn’t try to find a solution but we should do it with a mind open to the idea that there may not be a perfect solution or a single solution. Video distribution technology presents a unique challenge on its own, even without the authenticity aspect.

It seems almost certain, however, that blockchains can find a role in the solution, whether it’s a ledger for provenance, an archive of metadata, or a store of identities.