Why AI companies can’t stop listening to your voice recordings?

smart speaker voice assistant
Image credit: Depositphotos

This week, Facebook came under fire for having hired hundreds of contractors to listen to and transcribe users’ conversations.

Last week, a Vice Motherboard report revealed that Microsoft contractors were listening to audio recordings of personal conversations of Skype users who used the app’s AI translation service and voice commands sent to Cortana, the company’s AI-powered voice assistant.

This comes no longer as a surprise. Microsoft and Facebook aren’t not the first tech companies whose employees or remote contractors listen to users’ voices. Amazon, Google and Apple have been caught doing the same thing with their voice assistants in the past year (and they’ve all used cleverly worded EULA’s to gain users’ consent without explicitly telling them humans will listen to their voice). Earlier this month, Apple and Google stopped their programs to listen to audio recordings.

But this is not a tirade about the privacy concerns of voice assistants and smart speakers (which is an important topic). In this post, I’ll be diving into why every company that offers a voice assistant inevitably resorts to hiring human workers (often low-paid) to correct the stupid mistakes its AI algorithms make.

Voice assistants and deep learning

AI-powered voice assistants and translation services use deep learning, the branch of artificial intelligence that develops behavior through experience. At the heart of deep learning algorithms are artificial neural networks, software structures that are especially good at finding correlations and patterns in vast sets of data.

If you train a neural network with multiple audio recordings of the same word with different accents and background noises, it will tune its inner parameters to the statistical regularities between the different samples and will be able to detect the same word in new audio recordings. Likewise, if you provide a neural network with different texts corresponding to the same request, it will be able to answer to the different ways of uttering the same command.

Deep learning and neural networks have helped solve problems that were historically challenging for classic, rule-based software systems. This includes speech recognition, natural language processing (NLP), machine translation, and computer vision. These are tasks that are previously known to require human intelligence.

It is thanks to deep learning and neural networks that you can talk to Alexa almost as if you were talking to another person (as long as you don’t ask it anything too complicated—but we’ll get to that later).

Anthropomorphizing neural networks

Ironically, the biggest strength of neural networks also amplifies their greatest weakness. Given the complicated tasks they perform, neural networks and deep learning applications are often mistaken or compared to human intelligence.

But despite the remarkable feats they perform (and the name they’ve inherited from their biological counterparts), neural networks and deep learning algorithms are vastly different from the human mind.

Neural networks are as good as their training data. The more quality training data you provide to a neural network, the better it will become at performing its intended task. Also, the narrower the problem domain neural network tackles, the less data it will need to reach accuracy. Consequently, lack of training data and broad problem domains are two of worst enemies of deep learning.

Large tech companies usually have access to vast stores of data to train their AI. But the problem with voice assistants is that they are tackling a very broad problem domain, and they create the wrong expectations in users. They have human names and human-like voices, and their commercials always give the impression that you can ask them anything.

The Wizard of Oz effect

When you apply deep learning to an open and limitless domain, you never have enough training data. No matter how much you train your AI model, there will always be edge cases, scenarios that the neural network has not seen before. That’s why the companies that develop these services must constantly collect new data and retrain their AI models. This means they must monitor users’ behavior for things that confuse their AI.

Another problem is that the neural networks used in voice assistants require supervised learning, which requires human operators to annotate the training examples. When a voice assistant finds a certain command confusing, it can’t figure out the real meaning for itself. A human operator must map it to the right command and steer the AI in the right direction.

This is why the companies hire human contractors to listen to the voice recordings and annotate them with the right label, which the neural networks will then use to finetune its inner parameters. And this entails a host of privacy and ethical concerns.

The Microsoft story is just the latest manifestation of the “Wizard of Oz” effect, where companies try to automate tasks with AI technologies, but end up using human labor to perform those same tasks or to train the AI to avoid repeating its mistakes.

As the AI encounters more and more edge cases and is retrained to handle them, it will become better and better. With more training, the need for human help will become less significant. But when your problem domains is too broad, chasing edge cases turns into an endless war of attrition, and humans will always remain a part of the equation.

One stark example of this is moderating online content with AI, which require commonsense, reasoning and abstract thinking that neural networks don’t possess.

What this means is that, if you’re using a general-purpose voice assistant like Siri, Cortana or Alexa, you can expect it to become smarter. But those smarts will continue to come at the expense of your data.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.