In today’s world, it is nearly impossible to avoid voice-controlled digital assistants. From the interactive intelligent agents used by corporations, government agencies, and even personal devices, automated speech recognition (ASR) systems, combined with machine learning (ML) technology, increasingly are being used as an input modality that allows humans to interact with machines, ostensibly via the most common and simplest way possible: by speaking in a natural, conversational voice.
Yet as a study published in May 2020 by researchers from Stanford University indicated, the accuracy level of ASR systems from Google, Facebook, Microsoft, and others vary widely depending on the speaker’s race. While this study only focused on the differing accuracy levels for a small sample of African American and white speakers, it points to a larger concern about ASR accuracy and phonological awareness, including the ability to discern and understand accents, tonalities, rhythmic variations, and speech patterns that may differ from the voices used to initially train voice-activated chatbots, virtual assistants, and other voice-enabled systems.
The Stanford study, which was published in the journal Proceedings of the National Academy of Sciences, measured the error rates of ASR technology from Amazon, Apple, Google, IBM, and Microsoft, by comparing the system’s performance in understanding identical phrases (taken from pre-recorded interviews across two datasets) spoken by 73 black and 42 white speakers, then comparing the average word error rate (WER) for black and white speakers.
The subjects used in the recordings found in the first dataset were from Princeville, a predominantly African-American rural community in North Carolina; Rochester, a mid-sized city in western New York state, and the District of Columbia. The second dataset was the Voices of California, an ongoing compilation of interviews recorded across that state, although the focus was on Sacramento, the capital of California, and Humboldt County, a predominately white rural community in northern California.
The researchers indicated that black subjects spoke in what linguists refer to as African-American Vernacular English, a variety of English sometimes spoken by African-Americans in urban areas and other parts of the U.S. This is contrasted with the Standard English phrasing most often used by white speakers.
Overall, the researchers found the systems make far fewer errors with users who are white than with users who are black. ASR systems misidentified words about 19% of the time with white speakers, with the WER rising to 35% among black speakers. Approximately 2% of audio snippets from white people were considered unreadable by these systems, compared with 20% of snippets spoken by black people.
“Our paper posits that much of the disparity is likely due to the lack of training data on African Americans and African American Vernacular English speech,” explains Allison Koenecke, a Stanford doctoral student in Computational Mathematics & Engineering, and the first author of the study. “It seems like the lack of training data is in particular traced to disparities arising from the acoustic model, as opposed to the language model.”
Acoustical training models are focused on correctly understanding words despite differences in accents, speech patterns, tone of voice, and diction, compared with language models, which are designed to recognize various words and phrases used by speakers. According to the study, “Our findings indicate that the racial disparities we see arise primarily from a performance gap in the acoustic models, suggesting that the systems are confused by the phonological, phonetic, or prosodic characteristics of African American Vernacular English rather than the grammatical or lexical characteristics. The likely cause of this shortcoming is insufficient audio data from black speakers when training the models.”
The key to improving ASR accuracy among all speakers is to use a more diverse set of training data, which should include speakers that come from more diverse ethnic, cultural, and regional backgrounds, according to Sharad Goel, a co-author of the study and an assistant professor of management science and engineering at Stanford.
“We have tried to stay away from the blame game and say, ‘oh, we think you’re like, you know, good or bad because you didn’t prioritize it,’ but we really think this is important,” Goel says. “We hope people will change their behavior, especially these five companies, but also more broadly in the speech recognition community, toward improving these outcomes.”
ASR technology companies may be hearing that message loud and clear. An Amazon spokesperson pointed to a statement published after the release of the Stanford study, which noted that “fairness is one of our core AI principles, and we’re committed to making progress in this area … In the last year we’ve developed tools and datasets to help identify and carve out bias from ML models, and we offer these as open source for the larger community.”
Other vendors that utilize ASR technology say that despite their complexity and capabilities, ML models require a good deal of human oversight, particularly as models are trained. In some cases, ASR technology developers would use a relatively limited range of voices, speech patterns, or accents to train their acoustical models, with the goal of rapidly developing a solution that could be commercially deployed. While this approach may yield a high degree of accuracy with neutral speakers, it may struggle with accents or dialects that differ from the voices used to train the model.
“So, you could build out a quick and dirty solution that is very powerful, but it would fold over at the first hurdle because it doesn’t understand the accent, doesn’t understand the terminology, doesn’t even understand my language, and so on and so on,” says Andy Peart, chief marketing and strategy officer at Artificial Solutions, Stockholm, Sweden-based developer of the Teneo enterprise-focused conversational platform. “We would argue that you need to think about all these things to build out something that’s actually going be effective.”
Peart says Artificial Solutions uses a hybrid ML approach to training. ML is used for the initial training of the models, but human engineers are deployed to make sure that the system continually learns on the right inputs, which can include matching speaker voice inflections and pronunciations to the appropriate words or intents.
Further, the system is designed to assign a confidence ratio to the accuracy of the ASR model as applied to voice inputs. If the confidence ratio is below a certain threshold, the system is designed to ask the speaker for clarification, such as by asking, “did you mean_____?”
“We don’t settle for learning [solely] within the solution, because then you potentially get the Microsoft Tay situation, where your solution automatically learns and changes from the inputs without any control from the company. This would be catastrophic in a commercial environment,” Peart says, referring to the ability of users to train the Tay unsupervised ML-based chatbot to spew racist and otherwise offensive content, based on voice and text inputs and a lack of moderation of the machine’s responses by human engineers.
Other ASR vendors note the initial training data should be diverse, in order to function accurately for all types of users. “In order to train really good machine learning models, you need a large amount of data, but you also need diverse data,” says Johann Hauswald, co-founder and chief customer officer with Clinc, a conversational AI platform provider based in Ann Arbor, MI.
“We recommend customers use crowdsourcing platforms to collect training data,” Hauswald says, citing as examples Amazon Mechanical Turk and CrowdFlower (now Figure Eight), which include more diverse speaker data. “We take the approach of crowdsourcing that [training] data and not [relying solely on] a small set of folks collecting and training our data.”
Hauswald says the other advantage of using data from crowdsourced platforms is the ability to collect a wider range of words or phrases that mean the same thing, thereby expanding the lexicon of the ASR system (such as correctly identifying that “y’all” is a shortened, slang version of “you all” in Southern U.S. dialects). He notes the platforms ask the same question across a broad, diverse range of speakers, which increases the depth of the training model to account for ethnic, regional, gender, and other differentiators.
“You get a large amount of data, but then also from a diverse set of people,” Hauswald says, “It’s not one person giving you 500 utterances, and it’s not 500 people giving you a single [phrase].”
According to Hauswald, ASR systems struggle with heavily accented speech simply because there is significantly more training data consisting of non-accented English than there is for foreign or minority accented languages. Hauswald says ASR algorithms identify speech by looking for sound patterns, then linking them to appropriate words, which requires some human intervention in order to ensure that even when sounds are mispronounced (such as ‘r’ sounds being pronounced as ‘l’ sounds), the correct word is chosen. With less available foreign-accented data to analyze, it becomes more difficult to identify patterns that can be used to train the model accurately. One solution is to simply collect and train ASR models using speech data from accented speakers, and then using humans to ensure that the model correlates the accented pronunciations with the correct words. However, collecting enough speech data from each type of individual accent is fraught with compute, time, and data-collection challenges.
One way to speed up this process is to utilize a concept called transfer learning, a technique in which an ASR model is trained on a large set of data, such as speakers using un-accented English. The basic techniques the model uses to learn specific phonetic and speech data patterns then can be applied to a second, smaller dataset containing accented-English speech. The parameters and techniques from the first dataset are used as a starting point for training on the second dataset, which speeds up the learning process, allowing the new model training to focus on the unique pronunciations found in the accented speech.
“For languages or dialects that have less training data, research has shown you can use a language that has more data and use transfer learning to refine a model for the target language.” Hauswald says. He explains that approach has become popular, “initially in image processing, but now the same techniques are being applied to natural language processing and speech recognition pretty successfully. But you still need to go through that step of kind of hand annotating, sanitizing, and cleaning that data.”
Racial Disparities in automated speech recognition, Proceedings of the National Academy of Sciences of the United States of America, April 7, 2020. https://doi.org/10.1073/pnas.1915768117
Gender and Dialect Bias in YouTube’s Automatic Captions, Conference Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, January 2017. DOI: 10.18653/v1/W17-1606
Acoustic Modeling Explained: https://www.youtube.com/watch?v=5ktDTa8glaA