How Can AI Support Language Digitization and Digital Inclusion?

Date

February 26, 2026

Topics

Ethics, Equity, Inclusion

International Affairs, International Security, International Development

Natural Language Processing

Read Paper

abstract

This white paper analyzes the varying ways AI tools can advance language digitization work, and provides recommendations for responsibly realizing the potential of AI in supporting the digital inclusion of digitally disadvantaged languages.

In collaboration with

Executive Summary

In the wake of rapid AI development, attention is increasingly being drawn to the fact that most AI systems fail to serve most of the world’s linguistic communities. Data scarcity is often highlighted as a key reason, yet there are much more basic digital foundations that are prerequisites for building AI training datasets.
Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged, meaning that they are unsupported across mainstream devices, operating systems, browsers, and applications. Language communities excluded from digital systems can only participate minimally in a world increasingly mediated by technology and are at the same time unable to generate enough data needed to be represented in AI.
Empowering digitally disadvantaged language communities to participate in today’s digital world requires holistic progress on a set of foundational language tools (from script encoding to keyboard layouts) and supporting language tools (from grammar checkers to accessibility features).
A global network of language practitioners, scholars, and grassroots groups have been working tirelessly to create and sustain these language tools. Yet progress is often slow and uneven amid chronic underfunding and a lack of coordination.
AI has the potential to scale and accelerate language digitization. In recent years, scholars have begun leveraging AI — and especially natural language processing tools — to sidestep major bottlenecks in the field:
- In the early stages of language digitization, AI tools such as grapheme-to-phoneme systems, morphological analyzers, optical character recognition systems, and image generation models can assist with script development and foundational language infrastructure tooling.
- Once a language can effectively be rendered on devices, AI tools such as language identification models, optical character recognition systems, and automatic speech recognition systems can support language transcription and broader documentation and data collection processes.
- In the final stages of language digitization, AI tools such as machine translation, grammar- and spell-checking systems, text-to-speech systems, forced alignment tools, and large language models are increasingly the foundation for supporting digital tools that help ensure true digital inclusion.
It is important to note that not all language communities may choose to develop a writing system for their language. Technical approaches are emerging that enable the creation and use of digital tools for spoken-only languages.
While all these nascent efforts are promising, AI alone cannot address the field’s more fundamental research problems, workflow bottlenecks, and adoption challenges. Language digitization is also an inherently community-centric process that requires a deeply sensitive cultural and linguistic understanding. Much of the work in this field should thus continue to be driven by the language communities themselves, with AI as an accompanying tool.
Additional work, time, and resources need to be invested in harnessing AI for language digitization in a way that centers communities and their individual needs and contexts. We outline detailed recommendations for different stakeholders to work together to advance language digitization and digital inclusion in the age of AI, including:
- Building trust and empowering communities by fostering community-engaged convenings and collaborations, and building community-driven benchmarks and standards for digital language tools.
- Strengthening research foundations by creating reliable resources to track progress on digitally disadvantaged languages, investing in, expanding, and evaluating effective AI tools for language digitization, and creating forums for interdisciplinary exchanges.
- Improving workflows by moving to parallel workflows for language digitization and leveraging AI for organizational improvements.
- Forming coalitions by implementing mechanisms to reform incentive structures surrounding language tool adoption and strengthening storytelling for general audiences.
- Ensuring cultural sustainability by empowering culturally aware AI development, promoting contextualized adoption, and impact assessment.

Read Paper

Visualizing the potential role of AI in language digitization

A flow diagram titled “The potential role of AI tools in language digitization processes.” This diagram illustrates a staged pathway for language digitization and maps different artificial intelligence tools to specific phases within that process. The layout progresses horizontally from left to right, beginning with languages that exist only in oral form and culminating in languages that have advanced digital capabilities. Across the top of the diagram, a banner emphasizes that all digital inclusion pathways and tools should be community-centered, accessible, and respectful of data rights. Beneath the main progression, the diagram indicates that community deliberation and choice must also underpin each stage of the digitization process. The pathway begins with a language that is oral only. The next stage involves script development, including the creation of a new writing system and the standardization of orthography. Once a language is written, activities include collecting and analyzing written data. The process then moves to the creation of supporting digital infrastructure, which includes encoding the script into Unicode, developing keyboards, creating fonts and typefaces, designing user interfaces, and adding the language to global digital registries. After this infrastructure is in place, the language becomes digitized through the collection of digitized audio and text data. From there, the language may gain basic digital affordances, such as machine translation, indexing and search functions, and spell checking. In the final stage, the language attains advanced digital affordances, including voice assistants, e-learning tools, and other natural language processing applications. Positioned beneath these stages are labeled boxes identifying AI tools and describing their potential use cases. Arrows connect each tool to the stages it may support. Grapheme-to-phoneme models are shown as helping identify and resolve inconsistencies in orthography during script standardization. Morphological analyzers support analysis of word structure and linguistic variation. Image-to-text models, such as optical character recognition systems, assist in differentiating glyphs and reducing transcription time. Image generation models are shown as aiding font design by extrapolating full character sets. Language identification models support the sorting and classification of multilingual audio and text. Speech-to-text models, also known as automatic speech recognition systems, accelerate audio transcription. Text-to-speech models generate natural-sounding speech, including for accessibility purposes. Speech-to-speech translation models enable direct translation of spoken language. Forced alignment models support read-along materials by aligning audio with text. Traditional machine translation models expand datasets and enable cross-language access. Large language models are shown as assisting in annotating and expanding datasets for downstream applications.

Download high resolution image

This diagram depicts some of the most promising AI tools and techniques highlighted in the white paper that are currently being applied to ongoing efforts to bring digitally disadvantaged languages into the digital realm. The tools shown here — along a staged pathway for language digitization, from digitally disadvantaged, oral-only languages to languages with advanced digital affordances — are illustrative rather than exhaustive and represent possible applications that may support, but not replace, community-led language planning and governance decisions.

Much of this work rests on painstaking, often unglamorous groundwork. Real-world digitization is nonlinear, iterative, and full of unforeseen complications. As such, the figure offers a simplified vantage point on a far messier reality, highlighting the breadth of areas where AI tools can improve or scale workflows.

Read Paper

Link copied to clipboard!

Authors

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts
Juan N. Pava, Caroline Meinhardt, Haifa Badi Uz Zaman, Toni Friedman, Sang T. Truong, Daniel Zhang, Elena Cryst, Vukosi Marivate, Sanmi Koyejo
Deep DiveApr 22
whitepaper
This white paper maps the LLM development landscape for low-resource languages, highlighting challenges, trade-offs, and strategies to increase investment; prioritize cross-disciplinary, community-driven development; and ensure fair data ownership.
Caroline Meinhardt, Thomas Mullaney, Juan N. Pava, and Diyi Yang | How Can AI Support Language Digitization and Digital Inclusion?
seminarApr 15, 202612:00 PM - 1:15 PM
April
15
2026
What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.

Related Publications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning

Deep DiveDec 16, 2025

Issue Brief

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Issue Brief

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning

Foundation ModelsInternational Affairs, International Security, International DevelopmentDeep DiveDec 16

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Moving Beyond the Term "Global South" in AI Ethics and Policy

Evani Radiya-Dixit, Angèle Christin

Quick ReadNov 19, 2025

Issue Brief

This brief examines the limitations of the term "Global South" in AI ethics and policy, and highlights the importance of grounding such work in specific regions and power structures.

Issue Brief

Moving Beyond the Term "Global South" in AI Ethics and Policy

Evani Radiya-Dixit, Angèle Christin

Ethics, Equity, InclusionInternational Affairs, International Security, International DevelopmentQuick ReadNov 19

This brief examines the limitations of the term "Global South" in AI ethics and policy, and highlights the importance of grounding such work in specific regions and power structures.

Yejin Choi’s Briefing to the United Nations Security Council

Yejin Choi

Quick ReadSep 24, 2025

Testimony

In this address, presented to the United Nations Security Council meeting on "Maintenance of International Peace and Security," Yejin Choi calls on the global scientific and policy communities to expand the AI frontier for all by pursuing intelligence that is not only powerful, but also accessible, robust, and efficient. She stresses the need to rethink our dependence on massive-scale data and computing resources from the outset, and design methods that do more with less — by building AI that is smaller and serves all communities.

Testimony