Summit on Responsible Computing, AI, and Society_86A9696-Enhanced-NR.jpg

The Automatic Speech Recognition (ASR) models that power voice assistants like Amazon Alexa may have difficulty transcribing English speakers with minority dialects.

A study by Georgia Tech and Stanford researchers compared the transcribing performance of leading ASR models for people using Standard American English (SAE) and three minority dialects — African American Vernacular English (AAVE), Spanglish, and Chicano English.

Interactive Computing Ph.D. student Camille Harris is the lead author of a paper accepted into the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) this week in Miami.

Harris recruited people who spoke each dialect and had them read from a Spotify podcast dataset, which includes podcast audio and metadata. Harris then used three ASR models — wav2vec 2.0, HUBERT, and Whisper — to transcribe the audio and compare their performances.

For each model, Harris found SAE transcription significantly outperformed each minority dialect. The models more accurately transcribed men who spoke SAE than women who spoke SAE. Members who spoke Spanglish and Chicano English had the least accurate transcriptions out of the test groups. 

While the models transcribed SAE-speaking women less accurately than their male counterparts, that did not hold true across minority dialects. Minority men had the most inaccurate transcriptions of all demographics in the study.

“I think people would expect if women generally perform worse and minority dialects perform worse, then the combination of the two must also perform worse,” Harris said. “That’s not what we observed. 

“Sometimes minority dialect women performed better than Standard American English. We found a consistent pattern that men of color, particularly Black and Latino men, could be at the highest risk for these performance errors.”

Addressing underrepresentation

Harris said the cause of that outcome starts with the training data used to build these models. Model performance reflected the underrepresentation of minority dialects in the data sets.

AAVE performed best under the Whisper model, which Harris said had the most inclusive training data of minority dialects.

Harris also looked at whether her findings mirrored existing systems of oppression. Black men have high incarceration rates and are one of the people groups most targeted by police. Harris said there could be a correlation between that and the low rate of Black men enrolled in universities, which leads to less representation in technology spaces.

“Minority men performing worse than minority women doesn’t necessarily mean minority men are more oppressed,” she said. “They may be less represented than minority women in computing and the professional sector that develops these AI systems.”

Harris also had to be cautious of a few variables among AAVE, including code-switching and various regional subdialects.

Harris noted in her study there were cases of code-switching to SAE. Speakers who code-switched performed better than speakers who did not. 

Harris also tried to include different regional speakers.

“It’s interesting from a linguistic and history perspective if you look at migration patterns of Black folks — perhaps people moving from a southern state to a northern state over time creates different linguistic variations,” she said. “There are also generational variations in that older Black Americans may speak differently from younger folks. I think the variation was well represented in our data. We wanted to be sure to include that for robustness.”

TikTok barriers

Harris said she built her study on a paper she authored that examined user-design barriers and biases faced by Black content creators on TikTok. She presented that paper at the Association of Computing Machinery’s (ACM) 2023 Conference on Computer Supported Cooperative Works. 

Those content creators depended on TikTok for a significant portion of their income. When providing captions for videos grew in popularity, those creators noticed the ASR tool built into the app inaccurately transcribed them. That forced the creators to manually input their captions, while SAE speakers could use the ASR feature to their benefit.

“Minority users of these technologies will have to be more aware and keep in mind that they’ll probably have to do a lot more customization because things won’t be tailored to them,” Harris said.

Harris said there are ways that designers of ASR tools could work toward being more inclusive of minority dialects, but cultural challenges could arise.

“It could be difficult to collect more minority speech data, and you have to consider consent with that,” she said. “Developers need to be more community-engaged to think about the implications of their models and whether it’s something the community would find helpful.”