Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
policyWhite Paper

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Date
April 22, 2025
Topics
International Affairs, International Security, International Development
Natural Language Processing
Read Paper
abstract

This white paper maps the LLM development landscape for low-resource languages, highlighting challenges, trade-offs, and strategies to increase investment; prioritize cross-disciplinary, community-driven development; and ensure fair data ownership.

Executive Summary

  • Large language model (LLM) development suffers from a digital divide: Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South.

  • Low-resource languages (such as Swahili or Burmese) face two crucial limitations: a scarcity of labeled and unlabeled language data and poor quality data that is not sufficiently representative of the languages and their sociocultural contexts.

  • To bridge these gaps, researchers and developers are exploring different technical approaches to developing LLMs that better perform for and represent low-resource languages but come with different trade-offs:

    • Massively multilingual models, developed primarily by large U.S.-based firms, aim to improve performance for more languages by including a wider range of (100-plus) languages in their training datasets.

    • Regional multilingual models, developed by academics, governments, and nonprofits in the Global South, use smaller training datasets made up of 10-20 low-resource languages to better cater to and represent a smaller group of languages and cultures.

    • Monolingual or monocultural models, developed by a variety of public and private actors, are trained on or fine-tuned for a single low-resource language and thus tailored to perform well for that language.

  • Other efforts aim to address the underlying data scarcity problem by focusing on generating more language data and assembling more diverse labeled datasets:

    • Advanced machine translation models enable the low-cost production of raw, unlabeled data in low-resource languages, but the resulting data may lack linguistic precision and contextual cultural understanding.

    • Automated or semi-automated approaches can help streamline the process of labeling raw data, while participatory approaches that engage native speakers of low-resource languages throughout the entire LLM development cycle empower local communities while ensuring more accurate, diverse, and culturally representative LLMs.

  • It is crucial to understand both the underlying reasons for and paths to addressing these disparities to ensure that low-resource language communities are not disproportionately disadvantaged by and can equally contribute to and benefit from these models.

  • We present three overarching recommendations for AI researchers, funders, policymakers, and civil society organizations looking to support efforts to close the LLM divide:

    • Invest strategically in AI development for low-resource languages, including subsidizing cloud and computing resources, funding research that increases the availability and quality of low-resource language data, and supporting programs to promote research at the intersection of these issue areas.

    • Promote participatory research that is conducted in direct collaboration with low-resource language communities, who contribute to and even co-own the creation of AI resources.

    • Incentivize and support the creation of equitable data ownership frameworks that facilitate access to AI training data for developers while protecting the data rights of low-resource language data subjects and creators.

Read Paper
Share
Link copied to clipboard!
Authors
  • Juan Pava
    Juan Pava
  • Haifa Badi Uz Zaman
    Haifa Badi Uz Zaman
  • headshot
    Caroline Meinhardt
  • Toni Friedman
    Toni Friedman
  • Sang T. Truong
    Sang T. Truong
  • Daniel Zhang
    Daniel Zhang
  • headshot
    Elena Cryst
  • Vukosi Marivate
    Vukosi Marivate
  • Sanmi Koyejo
    Sanmi Koyejo
Related
  • Studies Explore Challenges Of AI For Low-Resource Languages
    Tech Brew
    May 05
    media mention

    HAI's white paper shows "Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South."

Related Publications

Policy Implications of DeepSeek AI’s Talent Base
Amy Zegart, Emerson Johnston
Quick ReadMay 06, 2025
Policy Brief

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Policy Brief

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston
International Affairs, International Security, International DevelopmentFoundation ModelsWorkforce, LaborQuick ReadMay 06

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Escalation Risks from LLMs in Military and Diplomatic Contexts
Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, Jacquelyn Schneider
May 02, 2024
Policy Brief

In this brief, scholars explain how they designed a wargame simulation to evaluate the escalation risks of large language models (LLMs) in high-stakes military and diplomatic decision-making.

Policy Brief

Escalation Risks from LLMs in Military and Diplomatic Contexts

Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, Jacquelyn Schneider
International Affairs, International Security, International DevelopmentMay 02

In this brief, scholars explain how they designed a wargame simulation to evaluate the escalation risks of large language models (LLMs) in high-stakes military and diplomatic decision-making.

Response to USAID's Request for Information on AI in Global Development Playbook
Caroline Meinhardt, Toni Friedman, Haifa Badi Uz Zaman, Daniel Zhang, Rodrigo Balbontín, Juan Pava, Vyoma Raman, Kevin Klyman, Marietje Schaake, Jef Caers, Francis Fukuyama
Mar 01, 2024
Response to Request

In this response to the U.S. Agency for International Development’s (USAID) request for information on the development of an AI in Global Development Playbook, scholars from Stanford HAI and The Asia Foundation call for an approach to AI in global development that is grounded in local perspectives and tailored to the specific circumstances of Global Majority countries.

Response to Request

Response to USAID's Request for Information on AI in Global Development Playbook

Caroline Meinhardt, Toni Friedman, Haifa Badi Uz Zaman, Daniel Zhang, Rodrigo Balbontín, Juan Pava, Vyoma Raman, Kevin Klyman, Marietje Schaake, Jef Caers, Francis Fukuyama
International Affairs, International Security, International DevelopmentMar 01

In this response to the U.S. Agency for International Development’s (USAID) request for information on the development of an AI in Global Development Playbook, scholars from Stanford HAI and The Asia Foundation call for an approach to AI in global development that is grounded in local perspectives and tailored to the specific circumstances of Global Majority countries.

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs
Fei-Fei Li
Sep 14, 2023
Testimony

We have arrived at an inflection point in the world of AI, largely propelled by breakthroughs in generative AI, including increasingly sophisticated language models like GPT-4. These models have revolutionized various sectors from customer service to adaptive learning. However, the scope of intelligence is far broader than linguistic capability alone. In my specialized field of computer vision, we have also witnessed remarkable advancements that empower machines to analyze and act upon visual information—essentially teaching computers to 'see.'

Testimony

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs

Fei-Fei Li
Government, Public AdministrationInternational Affairs, International Security, International DevelopmentSep 14

We have arrived at an inflection point in the world of AI, largely propelled by breakthroughs in generative AI, including increasingly sophisticated language models like GPT-4. These models have revolutionized various sectors from customer service to adaptive learning. However, the scope of intelligence is far broader than linguistic capability alone. In my specialized field of computer vision, we have also witnessed remarkable advancements that empower machines to analyze and act upon visual information—essentially teaching computers to 'see.'

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs