The quality of AI-generated voices has improved rapidly in recent years, but there are still aspects of human speech that escape synthetic imitation. Sure, AI actors can deliver smooth corporate voiceovers for presentations and advertisements, but more complex performances – a compelling rendition of Hamletfor example — stay out of reach.
Sonantic, an AI voice startup, says it has made a small breakthrough in audio deepfakes development, creating a synthetic voice that can utter subtleties like teasing and flirting. The company says the key to its rise is the inclusion of not-speech sounds in the audio; training its AI models to mimic those tiny breaths—little jeers and half-concealed chuckles—that give real speech its stamp of biological authenticity.
“We chose love as an overall theme,” said Sonantic co-founder and CTO John Flynn The edge† “But our research goal was to see if we could model subtle emotions. Bigger emotions are a little easier to capture.”
You can hear the company’s attempt at flirty AI in the video below — though whether you think it captures the nuances of human speech is a subjective question. On a first listen I thought the voice was almost indistinguishable from that of a real person, but colleagues from The edge say they immediately clocked it like a robot, pointing out the eerie spaces between certain words and a slight synthetic wrinkle in the pronunciation.
Sonantic CEO Zeena Qureshi describes the company’s software as “Photoshop for Speech”. The interface allows users to type in the speech they want to synthesize, specify the mood of the delivery, and then choose from a cast of AI voices, most of which are copied from real human actors. This is by no means a unique offering (rivals like Descript sell similar packs), but Sonantic says the level of customization is deeper than rivals.
Emotional choices for childbirth include anger, fear, sadness, happiness, and joy, and, with this week’s update, flirty, coy, teasing, and bragging. A “director mode” allows for even more tweaking: the pitch of a voice can be adjusted, the intensity of the delivery can be dialed up or down, and those little non-speech-like vocalizations like laughing and breathing can be inserted.
“I think that’s the main difference: our ability to direct and control and edit and shape a performance,” Flynn says. “Our customers are mainly triple-A game studios, entertainment studios and we are expanding into other industries. We recently entered into a partnership with Mercedes [to customize its in-car digital assistant] earlier this year.”
As is often the case with such technology, the real measure of Sonantic’s performance is the audio fresh off the machine learning models, rather than what’s used in polished, PR-ready demos. Flynn says the speech synthesized for his flirty video required “very little manual adjustment,” but the company went through a few different renderings to find the very best output.
To try and get a raw and representative sample of Sonantic’s technology, I asked them to display the same line (to you, dear roadside reader) with a handful of different moods. You can listen to them yourself to compare.
First off, here’s “flirty”:
And finally, “casual”:
In my ears at least these clips are one lot rougher than the demo. This suggests a number of things. First, that manual polishing is necessary to get the most out of AI voices. This is true of many AI efforts, such as self-driving cars, which have successfully automated very simple driving, but are still struggling with that last and all-important 5 percent that defines human competence. It means that fully automated, fully persuasive AI speech synthesis is still a long way off.
Second, I think it shows that the psychological concept of priming can do a lot to trick your senses. The video demo — featuring footage of a real-life human actor who is unsettlingly intimate toward the camera — may prompt your brain to hear the accompanying voice as real. Perhaps the best synthetic media are those that combine real and fake outputs.
Aside from how compelling the technology is, Sonantic’s demo raises other issues, such as: what are the ethics of deploying a flirtatious AI? Is it fair to manipulate listeners in this way? And why did Sonantic choose to make his flirtatious figure feminine? (It’s a choice that arguably perpetuates a subtle form of sexism in the male-dominated tech industry, where companies tend to code AI assistants as smooth — even flirty — secretaries.)
When asked the first question, the company said their choice of a female voice was simply inspired by Spike Jonze’s 2013 film Hair, where the main character falls in love with a female AI assistant named Samantha. Second, Sonantic said it recognizes the ethical dilemmas associated with developing new technology, and is careful about how and where it uses its AI voices.
“That’s one of the biggest reasons we’ve stuck to entertainment,” said CEO Qureshi. “CGI isn’t just used for anything — it’s used for the best entertainment products and simulations. we see this [technology] same way.” She adds that all of the company’s demos include a disclosure that the voice is indeed synthetic (although this doesn’t mean much if customers want to use the company’s software to generate votes for more deceptive purposes).
It makes sense to compare AI speech synthesis with other entertainment products. After all, being manipulated by film and TV may be why we make those things in the first place. But there is also something to be said for the fact that AI will allow such manipulation on a large scale, with less focus on its impact in individual cases. For example, all over the world, people are already forming relationships — and even falling in love — with AI chatbots. Adding AI-generated voices to these bots will certainly make them more powerful, raising questions about how these and other systems should be designed. If AI voices can flirt convincingly, what can they persuade you to do?