Why Our Digital Assistants Say the Dumbest Things!

Ever wondered why our digital assistants often fail in such spectacular and hilarious fashion? The surprising answer can be found within one of the most basic human skills.

Is your digital assistant having a hard time understanding you?

You know. Like when you ask: “Is it going to rain tonight?” and get back a cheerful, “Rain. A noun. Definition: Moisture condensed from the atmosphere that falls visibly in separate drops.”

YouTube is filled with pages full of similarly hilarious voice-recognition fails.

Turns out there’s a good reason why Siri, Cortana, Alexa and other personal digital assistants sometimes sound like they’ve missed a day of school.

Because human speech is enormously complex.

At first glance it may not seem that way. To us humans, speech is intuitive and comes almost naturally. That’s why the big tech companies have wisely chosen it as the gateway to their technologies and beyond. But to machines – even ones backed by bright minds and enormous processing power – human speech is anything but intuitive.

And that’s a big deal when you consider that a who’s who of tech firms, including Apple, Google, Amazon and Microsoft, are pouring tons of resources into machine learning and artificial intelligence to help us navigate the immense world of online services. In fact, an estimated 7 billion consumer devices worldwide will be managed by voice activation by 2020 [1], making it virtually effortless for us to buy a TV online, order a pizza, summon an Uber ride or stream our favorite tunes.

Impressive Strides, But Still a Way to Go

Don’t get me wrong. Recent gains in voice recognition have been impressive. A decade ago, voice recognition software posted a dismal 80 percent error rate [2]; that’s eight misheard words out of every 10. Now the error rate is down to about eight percent or so. While remarkable, that’s still too high, especially if you’re unfortunate enough to be in that eight percent. Messing up a song title may be funny and harmless, but messing up driving directions could be tragic.

The low-hanging fruit all but plucked, the tech giants now face the daunting task of driving that error rate from eight to zero. Plenty of factors stand in the way.

For starters, there are nearly 7,000 distinct languages across the world, which places a massive burden on developer resources. In France alone, at least 10 distinct Romance languages are spoken, including Picard, Gascon, Provençal and several others in addition to “French.” Plus many of the languages spoken worldwide are highly complex. The English vocabulary boasts about 200,000 words, many with multiple meanings.

Adding to the complexity are such factors as the multiple way words are pronounced, dialects, slang expressions, irony and sarcasm, all of which punctuate many languages. And just like their human counterparts, machines have a particularly difficult time picking up on accents.

Environmental factors that aren’t even speech-related, such as background noise and echoes, make the task of understanding human speech even more daunting.

Undaunted, the tech firms have turned to some pretty cutting-edge science to crack the code on human speech. Much of it is too complex to discuss in this blog – packed with words like “Mel frequency cepstral coefficients,” “Fourier spectrums” and “time-frequency domains” – but if you’re interested, check out a fascinating read.

We Aren’t Doing the Machines Any Favors

To top it all off, the machines aren’t getting much help from you and me. Instead of forcing our devices to adapt to us – which is the essence of machine learning – we’ve got things backwards and are trying to adapt to the machine. How? When addressing your digital assistant, how many times have you raised your voice, slowed down your speech or exaggerated each individual word in the hopes of increasing the odds of comprehension?

It turns out that speaking to our device this way is actually counterproductive. That’s because every time we talk to a device, it learns by identifying the context in which we give instructions, understanding phrases, picking up words and so on. Sure, it may hiccup from time to time, but that’s how learning works both for humans and machines.

If we want our devices to truly get smarter, we need to speak to them naturally, as if we were having a conversation with a friend rather than addressing wayward child.

Will our devices ever get to 100% speech-recognition worldwide? Frankly, I’d be surprised. The world is so vast and languages so complicated and ever-changing, errors are bound to happen. After all, we humans still misunderstand each other plenty often as well. It will be interesting to watch it play out.

So the next time your assistant responds “swing or samba?” when you ask it to play David Bowie’s “Let’s Dance” or steers you to Manchester, England, instead of Manchester Drive across town, now you know why.


[1] http://electronics360.globalspec.com/article/8856/7-billion-digital-assistants-in-connected-devices-by-2020

[2] https://www.wired.com/2016/04/long-form-voice-transcription/