The fundamental problem with smart speakers and voice-based AI assistants

Amazon echo alexa

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Since Amazon Echo shipped in late 2014, smart speakers and voice assistants have been advertised as the next big thing. Nearly four years later, despite the millions of devices sold, it’s clear that like many other visions of the tech industry, that perception was an overstatement. Testament to the fact: Most people aren’t using Alexa to make purchases, one of the main advertised use cases of Amazon’s AI-powered voice assistant.

Voice assistants have existed before the Echo. Apple released Siri in 2011 for iOS devices. But Echo was the first device where voice was the only user input mediem. And the years have made the limits of voice more prominent.

To be clear, voice assistants are very useful and their application will continue to expand and become integrated into an increasing number of domains in our daily lives, but not in the omnipresent way that an AI assistant implies. The future of voice is the integration of artificial intelligence in plenty of narrow settings and tasks instead of a broad, general purpose AI assistant that can fulfill anything and everything you can think of.

The technology underlying voice assistants

sound waves.jpg

To better understand the extent of the capabilities of voice assistants, we need to understand the technology that underlies them. Like many cutting edge software, voice assistants are powered by narrow artificial intelligence, the kind of AI that is extremely efficient at performing specific tasks, but unable to make general, abstract decisions like the human mind.

To be more specific, voice assistants leverage two specific branches of AI: voice recognition and natural language processing (NLP). When a user utters a command to Alexa, the voice recognition part converts the sound waves into to written words. The NLP part then takes those words and processes the commands they contain.

Both voice recognition and NLP have existed for quite a while. But advances in machine learning, deep learning and neural networks in recent years have fundamentally changed the way voice recognition and NLP work. For instance, when you provide a neural network with thousands and millions of voice samples and their corresponding words, it learns to create the underlying software that can turn voice commands into written text.

This is a major shift from the traditional way of creating software, where developers had to manually write the rules to parse sound waves, a process that is both very arduous and error-prone.

Likewise, NLP uses the same learn-by-example approach to parse the different nuances of human language and understand the underlying commands. This is the technology that powers many of today’s powerful applications such as chatbots and Google’s highly accurate translation engine.

The problem with integrating too many commands into smart speakers

Image credit: DepositPhotos

Voice recognition is a relatively narrow field. This means given enough samples, you can create a model that can recognize and transcribe voice commands under different circumstances and with different background noises and accents.

However, natural language processing is the challenging part of smart speakers, because it’s not a narrow field. Let’s say you have a voice assistant that can perform three or four specific commands. You provide its AI with enough samples of different ways that a user might utter those commands, and it develops a nearly flawless model that can understand and execute all the different ways those commands are sent.

This model works as long as the smart speaker can perform those three specific tasks and its users know that those are its only functions. But that is not how Amazon Echo and its counterparts, the Google Home and Apple HomePod work. For instance, Amazon enables developers to create new skills for its Alexa-powered devices, and since its release, the Echo has created a vast skills market around itself with more than 30,000 skills.

The problem with adding too many skills to a voice assistant is that there’s no way for the user to memorize the list of voice commands it can and can’t give the AI assistant. As a result, when an AI assistant can perform too many tasks, users will expect it to be able to understand and do anything they tell it.

But no matter how many functions and capabilities you add to an AI assistant, you’ll only be scratching the surface of the list of tasks that a human brain can come up with. And voice assistants suffer from the known limits of deep learning algorithms, which means they can only work in the distinct domains they’ve been trained for. As soon as you give them a command they don’t know about, they’ll either fail or start acting in erratic ways.

An alternative is to create a general-purpose AI that can do anything the user tells it. But that is general AI, something that is at least decades away and beyond the capabilities of current blends of AI. With today’s technology, if you try to tackle a problem domain that is too broad, you’ll end up having to add humans to the loop to make up for the failures of your AI.

The visual limits of voice assistants

mobile display apps.jpg

The skills problem is something that you’re not faced with on desktop computers, laptops and smartphones. That’s because those devices have a display and a graphical user interface (GUI) which clearly defines the capabilities and boundaries of each application. When you fire up a Windows or Mac computer, you can quickly see the list of applications that has been installed on them and get a general sense of the tasks you can perform with them.

On a smart speaker, you can use a computer of mobile device to see the list of skills that have been installed on the speaker. But that means you have to go out of your way and use a second device that can probably already perform the task you wanted to accomplish with your smart speaker in the first place.

An alternative would be to add a display to your smart speaker, as the Echo Show and the Echo Spot have done. But when you put a display on your smart speaker, you will probably add touch screen features to it too. The next thing you know, the main user interface becomes the display and touch screen, and the voice function becomes an optional, secondary feature. That’s exactly how Siri is on iOS and MacOS devices.

Another problem with voice is that it’s not suitable for complex, multistep tasks. Take the shopping example we mentioned at the beginning of the article. When shopping, users want to be able to browse among different choices and weigh different options against each other. That is something that is hard to do when you don’t have a display. So, in the case of shopping, a smart speaker or a voice assistant might be suitable for buying the usual household items such as detergent and toilet paper, but not clothes or electronic devices, where there is a lot of variety and difference.

Other tasks such as making reservations, which would require going back and forth between different screens or menu items when performed on a screen-based device would be equally challenging when ported to a voice assistant.

For most users of smart speakers, playing music, setting timers and calendar schedules, turning on the lights and other simple tasks constitute the majority of their interactions.

The future of AI and voice assistants

internet car.jpg

All this said, I don’t see voice assistants going away any time soon. But they will find their real use in environments where users want to perform simple tasks. Instead of seeing single devices that can perform many voice commands, we will probably see the emergence of many devices that can each perform a limited number of voice commands. This will become increasingly possible as the cost of hardware drops and the edge AI processor market develops.

Take the smart home, for instance. According to many experts, soon, computation and connectivity will become an inherent and inseparable characteristic of most home appliances. It’s easy to imagine things like light bulbs, ovens and thermostats being able to process voice commands either through a connection to the cloud or with local hardware. Unlike a smart speaker sitting in your living room, there are very few commands you can give to a light bulb or an oven, which means there’s little chance that users might become confused about their options or start giving commands that the voice AI doesn’t understand.

I expect voice-based AI to be successful in hotels, where clients want to perform a limited range of functions. I can also imagine users being able to plug their AI assistant, such as Alexa or Cortana, into their hotel room, which will be able to better parse their voice commands and have a digital profile of their lighting and air conditioning preferences, which it can apply autonomously.

Cars are also another suitable environment for voice assistants. Again, the functions a user performs inside a car are limited (open trunk, lock doors, play music, turn on the windshield wipers, set navigation course…), and it’s a setting where many users would enjoy the handsfree experience of a voice assistant and prefer it to the manual performance of tasks.

But the true potential of AI and voice assistants can manifest itself in AR headsets. In augmented reality settings, users must accomplish different complex tasks while also interacting with the physical world, which means they won’t be able to use input devices such as keyboards and mice. With the help of other technologies such as eye tracking and brain-computer interfaces (BCI), AI assistants will enable users to interact with their virtual and physical environments in a frictionless way.

Voice recognition and voice assistants are very promising branches of AI. But their potential might be a little different from our expectations.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.