Our research team is constantly looking at emerging technologies from conversational AI to connected products (IoT) to Augmented Reality. We’re having conversations internally all the time about new technologies and platforms, best practices for conducting research on them, and how to create better experiences for all users. Recently, we had a conversation about voice we’d like to share with you too to keep the conversation going more broadly.
I hear a lot of questions about how well voice interfaces can detect and understand international accents or how they can account for speech impediments, national origin, and cultural nuances. Academic research echoes what we’ve seen in user research: users with accents or who use “non-standard” English are often misunderstood by these systems. As a result, they often feel left out of the future being created, further marginalized by technologies that are supposed to enrich our lives, simplify our tasks, and bring us closer together. Clearly, we have to do better. So, let’s dive into this question of how well voice interfaces can understand accents and speech impediments.
TL;DR
Voice Recognition either understands something or it doesn’t. It doesn’t say, “that’s a strong accent” or “that’s a speech impediment,” it just says “I know what they’re saying” or “I don’t know what they’re saying.” As a machine learning implementation, it understands what was included in its training data. If accents, speech impediments, etc. were not included in sufficient numbers and variety in its training data, it simply does not recognize them.
Let’s dive deeper.
First of all, my explanation above is problematic. Voice Recognition doesn’t “understand” anything or “say” anything. It’s software. And this is the problem with much of the common perception of AI—it gets anthropomorphized and we project our understanding of human cognition & behavior onto it. Artificial Intelligence and Machine Learning are not like human intelligence or learning.
How does voice recognition work?
Basically, the software breaks what it hears down into their component parts and compares them to what’s in its database to try and identify a match. Once it does, it assigns it a “Confidence Score,” which is how sure it is that it got it right. With a score of 99%, the system could undeniably match what it “heard” with what it “knows”. With 70% confidence, it’s not really sure. Speech recognition systems will respond to user requests differently based on that confidence score.
For example, from high confidence to low confidence, here’s how a speech recognition system might respond to your request to add a piece of pie to your food order.
“Adding a piece of pie to your order.”
“I think you said ‘a piece of pie,’ is that right?”
“I’m sorry, I didn’t understand. Could you repeat your order?”
Voice recognition either correctly recognizes the words, incorrectly recognizes the words (for example, it might identify “piece of pie” as “pizza pie”), or doesn’t recognize the words. For a deeper dive into the technical specifics of voice recognition, I recommend reading this in-depth explanation. And if you're interested in the language behind the natural language processing, check out Cheryl Platz's conversational design primer.
What about accents and speech impediments?
If it doesn’t have enough examples of accents or speech impediments in its training set, it won’t be able to match the sounds it recorded with the examples it has, resulting in a lower confidence score. Most systems allow users to train speech recognition software on their own pronunciations by repeating a selection of words a number of times (how many words and how many times depends on the system and how much the user’s speech differs from the current data). However, most users are unaware of this opportunity to train the system and simply accept or reject it as is. Time and time again, our research shows people rarely return to a voice app if it fails to meet their expectations on first use.
How does AI learn local cultural nuances?
This is a big question, but the short answer is that it varies depending on whether you’re talking about colloquialisms (e.g., slang), social norms (e.g., politeness), or cultural information (e.g., what the speaker means when they say “God bless you”). If we’re talking colloquialisms, it only learns them if they are in its training data. If social norms or cultural information, it doesn’t currently “learn” them, it only recognizes them if they were intentionally and specifically hand-coded into the system.
To summarize, the ways we talk about artificial intelligence end up confusing the conversations about AI-driven experiences. Artificial Intelligence isn’t actually “intelligent.” Machine Learning doesn’t actually “learn.” They are essentially pattern recognition engines. So the question of what they “understand” is ultimately a question of how good they are at recognizing patterns and whether they have sufficient data to do so. For more information on how much AI is actually needed for these systems, check out our article on the subject.
The companies working on Natural Language Processing are making significant headway with these challenges, which is why I maintain that the user experience of voice interfaces is quickly becoming more of a design challenge than a technical one. And like all design, user research is a critical component to getting it right.
Want to learn more? Read more about our voice expertise.