Using AI to detect AI-generated deepfakes can work for audio — but not always

By Huo Jingnan

Published April 5, 2024 at 1:23 AM AKDT

As deepfake generation technology improves and leaves ever-fewer telltale signs that humans can rely on, computational methods for detection are becoming the norm. But technological solutions are no silver bullet for the problem of detecting AI-generated voices.

Artificial intelligence is supercharging audio deepfakes, with alarm bells ringing in areas from politics to financial fraud.

The federal government has banned robocalls using voices generated by AI and is offering a cash prize for solutions to mitigate harms from voice cloning frauds. At the same time, researchers and the private sector are racing to develop software to detect voice clones, with companies often marketing them as fraud-detection tools.

The stakes are high. Detection software getting it wrong can carry serious implications.

"If we label a real audio as fake, let's say, in a political context, what does that mean for the world? We lose trust in everything," says Sarah Barrington, an AI and forensics researcher at the University of California, Berkeley.

"And if we label fake audios as real, then the same thing applies. We can get anyone to do or say anything and completely distort the discourse of what the truth is."

As deepfake generation technology improves and leaves ever-fewer telltale signs that humans can rely on, computational methods for detection are becoming the norm.

But an NPR experiment indicated that technological solutions are no silver bullet for the problem of detecting AI-generated voices.

Probably yes? Probably not

NPR identified three deepfake audio detection providers — Pindrop Security, AI or Not and AI Voice Detector. Most claim their tools are over 90% accurate at differentiating between real audio and AI-generated audio. Pindrop only works with businesses, while the others are available for individuals to use.

NPR submitted 84 clips of five to eight seconds to each provider. About half of the clips were snippets of real radio stories from three NPR reporters. The rest were cloned voices of the same reporters saying the same words as in the authentic clips.

The voice clones were generated by technology company PlayHT. To clone each voice, NPR submitted four 30-second clips of audio — one snippet of a previously aired radio story of each reporter and one recording made for this purpose.

Our experiment revealed that the detection software often failed to identify AI-generated clips, or misidentified real voices as AI-generated, or both. While Pindrop Security's tool got all but three samples correct, AI or Not's tool got about half wrong, failing to catch most of the AI-generated clips.

The verdicts these companies provide aren't just a binary yes or no. They give their results in the form of probabilities between 0% and 100%, indicating how likely it is that the audio was generated by AI.

AI Voice Detector's CEO, Abdellah Azzouzi, told NPR in an interview that if the model predicts that a clip was 60% or more likely to be generated by AI, then it considers the clip AI-generated. Under this definition, the tool wrongly identified 20 out of the 84 samples NPR submitted.

AI Voice Detector updated its website after the interview. While the probability percentages for most previously tested clips remained the same, they now include an additional note laying out a new way of interpreting those results. Clips flagged as 80% or more are now deemed "highly likely to be generated by AI." Those scoring between 20% and 80% are "inconclusive." Clips rated less than 20 are "highly likely to be real."

In an email to NPR, the company did not respond to NPR's questions about why the thresholds changed, but says it's "always updating our services to offer the best to those who trust us." The company also removed the claim from its website that the tool was more than 90% accurate.

Under these revised definitions, AI Voice Detector's tool got five of the clips NPR submitted wrong and returned inconclusive results for 32 clips.

While the other providers also provide results as probabilities, they did not provide results marked as inconclusive.

Using AI to catch AI

While NPR's anecdotal experiment is not a formal test or academic study, it highlights some challenges in the tricky business of deepfake detection.

Detection technologies often involve training machine learning models. Since machine learning and artificial intelligence are virtually the same technology, people also call this approach "using AI to detect AI."

Barrington has both tested various detection methods and developed one with her team. Researchers curate a dataset of real audio and fake audio, transforming each into a series of numbers that are fed into the computer to analyze. The computer then finds the patterns humans cannot see to distinguish the two.

"Things like in the frequency domain, or very sort of small differences between audio signals and the noise, and things that we can't hear but to a computer are actually quite obvious," says Barrington.

Amit Gupta, head of product at Pindrop Security, says one of the things their algorithm does when evaluating a piece of audio is to reverse-engineer the vocal tract — the actual physical properties of a person's body — that would be needed to produce the sound. They called one fraudster's voice that they caught "Giraffe Man."

"When you hear the sequence of sound from that fraudster, it is only possible for a vocal tract where a human had a 7-foot-long neck," Gupta says. "Machines don't have a vocal tract. ... And that's where they make mistakes."

Anatoly Kvitnitsky, CEO of AI or Not, says his company trains its machine learning model based on clients' specific-use cases. As a result, he said, the general-use model the public has access to is not as accurate.

"The format is a little bit different depending on if it's a phone call ... if it's a YouTube video. If it's a Spotify song, or TikTok video. All of those formats leave a different kind of trace."

While often better at detecting fake audio than people, machine learning models can easily be stumped in the wild. Accuracy can drop if the audio is degraded or contains background noise. Model makers need to train their detectors on every new AI audio generator on the market to detect the subtle differences between them and real people. With new deepfake models being released frequently and open source models becoming available for everyone to tweak and use, it's a game of whack-a-mole.

After NPR told AI or Not which provider it used to generate the deepfake audio clips, the company released an updated detection model that returned better results. It caught most of the AI clips, but also misidentified more real voices as AI. Its tool cannot process some other clips and returned error messages.

What's more, all of these accuracy rates only pertain to English-language audio. Machine learning models need to analyze real and fake audio samples from each language to tell the difference between them.

While there seems to be an arm's race between deepfake voice generators and deepfake voice detectors, Barrington says it's important for the two sides to work together to make detection better.

ElevenLabs, whose technology was used to create the audio for the deepfake Biden robocall, has a publicly available tool that detects its own product. Previously, the website claimed that the tool also detects audio generated by other providers, but independent research has shown poor results. PlayHT says a tool to detect AI voices — including its own — is still under development.

Detection at scale isn't there yet

Tech giants including major social media companies such as Meta, TikTok and X have expressed their interest in "developing technology to watermark, detect and label realistic content that's been created with AI." Most platforms' efforts seem to focus more on video, and it's unclear whether that would include audio, says Katie Harbath, chief global affairs officer at Duco Experts, a consultancy on trust and safety.

In March, YouTube announced that it would require content creators to self-label some videos made with generative AI before they upload videos. This follows similar steps from TikTok. Meta says it's also going to roll out labeling on Facebook and Instagram, using watermarks from companies that produce generative AI content.

Barrington says specific algorithms could detect deepfakes of world leaders whose voices are well known and documented, such as President Biden. That won't be the case for people who are less well known.

"What people should be very careful about is the potential for deepfake audio in down-ballot races," Harbath says. With less local journalism and with fact-checkers at capacity, deepfakes could cause disruption.

As for scam calls impersonating loved ones, there's no high-tech detection that flags them. You and your family can come up with questions a scammer wouldn't know the answer to in advance, and the FTC recommends calling back to make sure the call was not spoofed.

"Anyone who says 'here's an algorithm,' just, you know, a web browser plug-in, it will tell you yes or no — I think that's hugely misleading," Barrington says.