Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
policyWhite Paper

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Date
April 22, 2025
Topics
International Affairs, International Security, International Development
Natural Language Processing
Ethics, Equity, Inclusion
Read Paper
abstract

This white paper maps the LLM development landscape for low-resource languages, highlighting challenges, trade-offs, and strategies to increase investment; prioritize cross-disciplinary, community-driven development; and ensure fair data ownership.

In collaboration with

Executive Summary

  • Large language model (LLM) development suffers from a digital divide: Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South.

  • Low-resource languages (such as Swahili or Burmese) face two crucial limitations: a scarcity of labeled and unlabeled language data and poor quality data that is not sufficiently representative of the languages and their sociocultural contexts.

  • To bridge these gaps, researchers and developers are exploring different technical approaches to developing LLMs that better perform for and represent low-resource languages but come with different trade-offs:

    • Massively multilingual models, developed primarily by large U.S.-based firms, aim to improve performance for more languages by including a wider range of (100-plus) languages in their training datasets.

    • Regional multilingual models, developed by academics, governments, and nonprofits in the Global South, use smaller training datasets made up of 10-20 low-resource languages to better cater to and represent a smaller group of languages and cultures.

    • Monolingual or monocultural models, developed by a variety of public and private actors, are trained on or fine-tuned for a single low-resource language and thus tailored to perform well for that language.

  • Other efforts aim to address the underlying data scarcity problem by focusing on generating more language data and assembling more diverse labeled datasets:

    • Advanced machine translation models enable the low-cost production of raw, unlabeled data in low-resource languages, but the resulting data may lack linguistic precision and contextual cultural understanding.

    • Automated or semi-automated approaches can help streamline the process of labeling raw data, while participatory approaches that engage native speakers of low-resource languages throughout the entire LLM development cycle empower local communities while ensuring more accurate, diverse, and culturally representative LLMs.

  • It is crucial to understand both the underlying reasons for and paths to addressing these disparities to ensure that low-resource language communities are not disproportionately disadvantaged by and can equally contribute to and benefit from these models.

  • We present three overarching recommendations for AI researchers, funders, policymakers, and civil society organizations looking to support efforts to close the LLM divide:

    • Invest strategically in AI development for low-resource languages, including subsidizing cloud and computing resources, funding research that increases the availability and quality of low-resource language data, and supporting programs to promote research at the intersection of these issue areas.

    • Promote participatory research that is conducted in direct collaboration with low-resource language communities, who contribute to and even co-own the creation of AI resources.

    • Incentivize and support the creation of equitable data ownership frameworks that facilitate access to AI training data for developers while protecting the data rights of low-resource language data subjects and creators.

Read Paper
Share
Link copied to clipboard!
Authors
  • Juan Pava
    Juan Pava
  • headshot
    Caroline Meinhardt
  • Haifa Badi Uz Zaman
    Haifa Badi Uz Zaman
  • Toni Friedman
    Toni Friedman
  • Sang T. Truong
    Sang T. Truong
  • Daniel Zhang
    Daniel Zhang
  • headshot
    Elena Cryst
  • Vukosi Marivate
    Vukosi Marivate
  • Sanmi Koyejo
    Sanmi Koyejo
Related
  • Studies Explore Challenges Of AI For Low-Resource Languages
    Tech Brew
    May 05
    media mention

    HAI's white paper shows "Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South."

Related Publications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications
Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning
Deep DiveDec 16, 2025
Issue Brief

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Issue Brief

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning
Foundation ModelsInternational Affairs, International Security, International DevelopmentDeep DiveDec 16

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Moving Beyond the Term "Global South" in AI Ethics and Policy
Evani Radiya-Dixit, Angèle Christin
Quick ReadNov 19, 2025
Issue Brief

This brief examines the limitations of the term "Global South" in AI ethics and policy, and highlights the importance of grounding such work in specific regions and power structures.

Issue Brief

Moving Beyond the Term "Global South" in AI Ethics and Policy

Evani Radiya-Dixit, Angèle Christin
Ethics, Equity, InclusionInternational Affairs, International Security, International DevelopmentQuick ReadNov 19

This brief examines the limitations of the term "Global South" in AI ethics and policy, and highlights the importance of grounding such work in specific regions and power structures.

Yejin Choi’s Briefing to the United Nations Security Council
Yejin Choi
Quick ReadSep 24, 2025
Testimony

In this address, presented to the United Nations Security Council meeting on "Maintenance of International Peace and Security," Yejin Choi calls on the global scientific and policy communities to expand the AI frontier for all by pursuing intelligence that is not only powerful, but also accessible, robust, and efficient. She stresses the need to rethink our dependence on massive-scale data and computing resources from the outset, and design methods that do more with less — by building AI that is smaller and serves all communities.

Testimony

Yejin Choi’s Briefing to the United Nations Security Council

Yejin Choi
International Affairs, International Security, International DevelopmentQuick ReadSep 24

In this address, presented to the United Nations Security Council meeting on "Maintenance of International Peace and Security," Yejin Choi calls on the global scientific and policy communities to expand the AI frontier for all by pursuing intelligence that is not only powerful, but also accessible, robust, and efficient. She stresses the need to rethink our dependence on massive-scale data and computing resources from the outset, and design methods that do more with less — by building AI that is smaller and serves all communities.

Increasing Fairness in Medicare Payment Algorithms
Marissa Reitsma, Thomas G. McGuire, Sherri Rose
Quick ReadSep 01, 2025
Policy Brief

This brief introduces two algorithms that can promote fairer Medicare Advantage spending for minority populations.

Policy Brief

Increasing Fairness in Medicare Payment Algorithms

Marissa Reitsma, Thomas G. McGuire, Sherri Rose
Ethics, Equity, InclusionHealthcareQuick ReadSep 01

This brief introduces two algorithms that can promote fairer Medicare Advantage spending for minority populations.