As posted on This is the secret to how Apple is making Siri sound more human
The next Siri won’t put the emPHAsis on the wrong sylLAble.
That’s more or less the promise Apple made during last week’s World Wide Developers Conference Keynote. Demonstrating onstage, Apple’s senior vice president of Software Engineering, Craig Federighi, asked Siri about the weather.
“Here’s the forecast for the next three days: Sunny, sunny, and sunny,” replied Siri.
Each “sunny” sounded a shade different. Though Federighi declared it “very powerful,” the developer audience didn’t break into wild applause.
Maybe that’s a victory in itself. With the upcoming iOS 11, the now 6-year-old Siri will sound so natural that no one will notice, and by notice I mean those cringe-worthy moments when Siri (or really any voice assistant) attempts to pronounce a name, location, or offer a more natural reply and it sounds like they swallowed a fly mid-sentence. (My personal favorite is when Siri mangles the name of my hometown.)
Part of that is a result of how Siri’s voice was originally built. As Susan Bennett, the woman widely considered to be the first voice of Siri, recounted to The Guardian late last year, Nuance, which built Siri’s original voice recognition and response, had her record “hundreds and hundreds of sentences and phrases created to get all sound combinations in the phrases.”
And, no, she wasn’t recording, “The Weather in El Paso is 100 degrees and sunny.”
Instead, Bennett and others who were the original Siri voices recorded sentence after sentence that didn’t make any sense. Things like “Fasa, ask fasa ask sati” and “Say the shrading again, say the shraeding again.”
With all those speech parts, Siri could construct reasonable facsimiles of voice responses for a dizzying array of questions, even if they didn’t all sound exactly human.
Now, though, Siri’s on everything from the iPhone to Apple TV to the Mac to Apple Watch (and, soon Apple’s HomePod). She also handles, according to Apple, 2 billion voice requests each week and responds with at least as many sentences. So, Siri’s mispronunciations and occasionally halting responses are almost inescapable.
It was time for a change, though, in truth, Siri is always changing.
Last year, Apple told me they had given Siri what amounted to a brain transplant, without much fanfare. They started applying machine learning to the natural language processing and saw improvements in speech recognition and understanding queries over background noise.
Now, Apple is taking the same machine-learning-powered approach to Siri’s own speech.
Siri’s voice in iOS 11, Apple told me, is totally new.
Building Siri’s voice still starts with snippets of recorded audio that are woven together into Siri’s audio responses. While it’s not clear if Apple still uses nonsense sentences, the company does say Siri can say anything.
That’s because the technology used to create cogent sentences is the same ones that helped Siri better understand you.
Apple is using machine learning or, more specifically, deep learning and neural networks, a sub-discipline of machine learning that seeks to replicate the way brains function and learn, to stitch together the parts into responses.
To make the responses sound more natural, Apple fed examples of real people speaking into its Machine Learning system. It analyzed nuances in human speech like when people take a breath, and how voices rise and fall in a single sentence, and, of course, emphasis and intonation.
The algorithm also uses the power of artificial intelligence and machine learning to look at sentence construction and why the same word placed in three different positions in one sentence should be pronounced in three distinctly different ways.
These are things we don’t really pay attention to or notice because it’s the way we all speak – unless we are computers.
Apple is preparing to blur that line, not so much to fool people into thinking Siri is human, but to shift the focus from the way Siri speaks to the information the digital assistant provides. This will become especially important as Siri becomes more conversational. In iOS 11, you’ll be able to dive deeper into Siri responses by tapping the screen, and then asking a follow-up question. That give and take will put more pressure on Apple to make Siri sound as normal (or real) as possible.
It will be interesting to see how this translates around the world. Siri is now in 36 countries, covering 21 different languages, and Apple is launching a new Siri translation feature with five languages and more coming soon.
Maybe we won’t have to tell Siri to Habla con naturalidad, because she’ll already be doing it.