Every day millions of Standard English speakers enjoy the benefits provided by natural language processing (NLP) models. But for speakers of African American Vernacular English (AAVE), technologies like voice-operated GPS systems, digital assistants, and speech-to-text software are often problematic because large NLP models frequently are unable to understand or generate words in AAVE. Even worse, models are often trained on data scraped from the web and are prone to incorporating the racial bias and stereotypical associations that are rampant online. When these biased models are used by companies to help make high-stakes decisions, AAVE speakers can find themselves unfairly restricted from social media, inappropriately denied access to housing or loan opportunities, or unjustly treated in the law enforcement or judicial systems.
For the past 18 months, machine learning specialist Jazmia Henry has focused on finding a way to responsibly incorporate AAVE into language models. As a fellow at the Stanford Institute for Human-Centered Artificial Intelligence and the Center for Comparative Studies in Race and Ethnicity, she has created an open-source corpora of more than 141,000 AAVE words to help researchers and builders design models that are both inclusive and less susceptible to bias.
“My hope with this project is that social and computational linguists, anthropologists, computer scientists, social scientists, and other researchers will poke and prod at this corpora, do research with it, wrestle with it, and test its limits so we can grow this into a true representation of AAVE and provide feedback and insight on our potential next steps algorithmically,” Henry says.
In this interview, she describes the early obstacles in developing this database, its potential to help computational linguistics understand the origins of AAVE, and her plans post-Stanford.
How do you describe African American Vernacular English?
To me, AAVE is a language of perseverance and uplift. It’s the result of African languages thought to have been lost during the slave trade migration that have been incorporated into English to create a new language used by the descendants of those African peoples.
How did you become interested in including AAVE in NLP models?
As a child, both my parents occasionally spoke their native languages. For my Caribbean father that was Jamaican patois, and for my mother it was Gullah Geechee, found in the coastal areas of the Carolinas and Georgia. Each language was a creole, which is a new language created by blending different languages. Everyone seemed to understand that my parents were speaking a different language, and no one doubted their intelligence. But when I saw people in my community speaking AAVE, which I believe to be another creole language, I could tell that there was a shame and stigma associated with it — a sense that if we used this language outside, we were going to be judged as being less intelligent. When I began working in data science, I wondered what would happen if I tried to collect data on AAVE and incorporate it into NLP models so we could really begin to understand it and improve the performance of these models.
How did your project evolve, and what obstacles did you encounter?
There were a lot of obstacles and in the end I had to change my objective. AAVE evolves much more quickly than many languages and often turns standardized English on its head, giving words entirely new meanings. For example, the word “mad” is often defined as meaning “angry.” In AAVE, however, it’s frequently used to mean “very,” as in “mad funny.” AAVE can also be largely defined by the situation, the speaker, and the tone being used, things that language processing models don’t take into consideration. I eventually decided to create a corpus of AAVE, which is broken down into four collections. The lyric collection includes the words to 15,000 songs by 105 artists ranging from Etta James and Muddy Waters all the way up to Lil Baby and DaBaby. The leadership collection includes speeches from consequential individuals ranging from Fredrick Douglass and Sojourner Truth to Martin Luther King and Ketanji Brown Jackson. The most difficult to put together has been the book collection, because African Americans are grossly underrepresented in the literary canon, but I’ve included works from historically Black book archive collections from universities. Finally, the social media collection is the most robust and diverse and includes video transcripts, blog posts, and 15,000 tweets, all collected from Black thought leaders.
How do you hope your project will be used?
I know the corpora is beginning to be used, but I don’t yet know by whom or for what purpose. My hope is that this preliminary work inspires researchers to enter this space, question it, and push it forward to make sure AAVE is represented in the languages used in NLP. Social and computational linguists may be able to use this to help determine if AAVE is in fact its own language or dialect and to look for links between it and other African languages, particularly ones that have not been recorded or preserved in Western history.
Growing up, we learned what was taken from our enslaved ancestors and from their descendants. AAVE may be the proof that everything wasn’t taken away and that we were able to retain some of who we were in the way we communicate with each other. That knowledge has the potential to remove shame and inject pride. When I’m saying “What up, my brother?” I’m not being unintelligent; I’m being strategic and calling on our ancestors with that conversation.
How well does today’s natural language processing reflect the broader community?
Not only does it not reflect the broader community, it also actively discriminates against that community. Large language models that struggle to understand or generate words in AAVE are more likely to exacerbate stereotypes about Black people generally, and these biased associations are being codified within these models. When they’re commercialized, these models — and their biases — can result in companies making unfair decisions that affect the lives of AAVE speakers. This can result in everything from individuals having their social media disproportionately edited or removed from platforms to discrimination in areas such as housing, banking, and the law enforcement and judicial systems.
What should NLP developers be thinking about as they build tools?
There have been some popular NLP models that incorporate a lot of bias. Companies are working to scale back these problematic models, but that’s often followed by a focus on risk mitigation over bias mitigation. Rather than try to find solutions, companies will sometimes take the approach of saying “Let’s not touch AAVE or anything that has to do with Blackness again, because we didn’t do it right the first time.” Instead, they should be asking how they can do it correctly now. This is the time to build models that are better, that improve on processes, and that come up with new ways to work with languages such as AAVE, so larger companies don’t continue to perpetuate harm.
What are your plans moving forward as you leave Stanford?
I’m starting a new job at Microsoft, where I’ll be working as a senior applied engineer for the autonomous systems team with Project Bonsai. We’re increasing deep reinforcement learning capabilities with something we call “machine teaching,” which is essentially teaching machines how to perform tasks that can make humans more productive, improve safety, and allow for autonomous decision-making using AI. This work gives me the chance to improve people’s lives, and I’m so grateful for the opportunity.
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.