Skip to main content Skip to secondary navigation
Page Content
Image
An older woman speaking to her smart speaker in living room turn on music

At this point, bias in AI and natural language processing (NLP) is such a well-documented and frequent issue in the news that when researchers and journalists point out yet another example of prejudice in language models, readers can hardly be surprised. 

However, less well-publicized are the talented minds working to solve these issues of bias, like Caleb Ziems, a third-year PhD student mentored by Diyi Yang, assistant professor in the Computer Science Department at Stanford and an affiliate of Stanford’s Institute for Human-Centered AI (HAI). The research of Ziems and his colleagues led to the development of Multi-VALUE, a suite of resources that aim to address equity challenges in NLP, specifically around the observed performance drops for different English dialects. The result could mean AI tools from voice assistants to translation and transcription services that are more fair and accurate for a wider range of speakers.

“It’s no secret that language technologies have issues with equity in their capacity to operate with speakers of different languages and different varieties of language,” Ziems says. “English is a global contact language which individuals from different communities use to interact with the global economy, global markets, and global partners. So it’s important for accessibility that language technologies can handle the disparities and variations in English.” 

Analyzing Grammar, Not Vocabulary

Current language technologies, which are typically trained on Standard American English (SAE), are fraught with performance issues when handling other English variants. “We’ve seen performance drops in question-answering for Singapore English, for example, of up to 19 percent,” says Ziems. Many of these variants are also considered “low resource,” meaning there’s a paucity of natural, real-world examples of people using these languages. 

Ziems reframed this data-scarcity challenge by looking at what data they do have in abundance. “We used decades of linguistic research housed in a rich online catalog that acts essentially as a structured database of features and rules of these English variants.” By examining the grammatical role that each word plays in a sentence for a specific variant, Ziems could tag and rearrange words, transforming SAE phrases into phrases for different English dialects. For example, the SAE phrase “John was scolded by his boss” would be transformed into Colloquial Singapore English as “John give his boss scold.” 

As Ziems relates, “Many of these patterns were observed by field linguists operating in an oral context with native speakers, and then transcribed.” With this empirical data and the subsequent language rules, Ziems could build a framework for language transformation. Looking at parts of speech and grammatical rules for these dialects enabled Ziems to take a SAE sentence like “She doesn’t have a camera” and break it down into its discrete parts. “We might identify that there’s a negation in there — ‘not’ — and that the verb ‘do’ is connected to that negation.” By analyzing parts of speech in this way, as opposed to just vocabulary, Ziems believes he and the research team have built a robust and comprehensive framework to achieve dialect invariance — constant performance over dialect shifts. 

Limitations and Next Steps

Though Ziems’ work takes an important step in exposing challenges with language variants, he is quick to acknowledge its limitations. “Dialects aren’t nicely bounded, fixed entities. It’s impossible to stop language from changing and acquiring new features, and even the linguistic observations of these features can shift in frequency between speakers from different regions.” 

Even so, Multi-VALUE allows researchers working with language technologies to build parallel datasets that can be used to train and augment their work. “It’s very inefficient to train a new model for every dialect. It’s just not practical for the real world, so moving forward, we can use these tools to train smaller portions of the model — called adapters — which are swappable and can change to adapt to a certain dialect.” 

Ziems also points out the considerable work already being done in the field of NLP in the English language. “Half the world’s population is bilingual or multilingual, and English is just one of the many tools in the toolkit. But it has such an outsized impact on the global economy, and it's a language where we can efficiently and effectively capture this problem of language equity.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more 

More News Topics