How Can AI Support Language Digitization and Digital Inclusion?

This white paper analyzes the varying ways AI tools can advance language digitization work, and provides recommendations for responsibly realizing the potential of AI in supporting the digital inclusion of digitally disadvantaged languages.
Executive Summary
In the wake of rapid AI development, attention is increasingly being drawn to the fact that most AI systems fail to serve most of the world’s linguistic communities. Data scarcity is often highlighted as a key reason, yet there are much more basic digital foundations that are prerequisites for building AI training datasets.
Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged, meaning that they are unsupported across mainstream devices, operating systems, browsers, and applications. Language communities excluded from digital systems can only participate minimally in a world increasingly mediated by technology and are at the same time unable to generate enough data needed to be represented in AI.
Empowering digitally disadvantaged language communities to participate in today’s digital world requires holistic progress on a set of foundational language tools (from script encoding to keyboard layouts) and supporting language tools (from grammar checkers to accessibility features).
A global network of language practitioners, scholars, and grassroots groups have been working tirelessly to create and sustain these language tools. Yet progress is often slow and uneven amid chronic underfunding and a lack of coordination.
AI has the potential to scale and accelerate language digitization. In recent years, scholars have begun leveraging AI — and especially natural language processing tools — to sidestep major bottlenecks in the field:
In the early stages of language digitization, AI tools such as grapheme-to-phoneme systems, morphological analyzers, optical character recognition systems, and image generation models can assist with script development and foundational language infrastructure tooling.
Once a language can effectively be rendered on devices, AI tools such as language identification models, optical character recognition systems, and automatic speech recognition systems can support language transcription and broader documentation and data collection processes.
In the final stages of language digitization, AI tools such as machine translation, grammar- and spell-checking systems, text-to-speech systems, forced alignment tools, and large language models are increasingly the foundation for supporting digital tools that help ensure true digital inclusion.
It is important to note that not all language communities may choose to develop a writing system for their language. Technical approaches are emerging that enable the creation and use of digital tools for spoken-only languages.
While all these nascent efforts are promising, AI alone cannot address the field’s more fundamental research problems, workflow bottlenecks, and adoption challenges. Language digitization is also an inherently community-centric process that requires a deeply sensitive cultural and linguistic understanding. Much of the work in this field should thus continue to be driven by the language communities themselves, with AI as an accompanying tool.
Additional work, time, and resources need to be invested in harnessing AI for language digitization in a way that centers communities and their individual needs and contexts. We outline detailed recommendations for different stakeholders to work together to advance language digitization and digital inclusion in the age of AI, including:
Building trust and empowering communities by fostering community-engaged convenings and collaborations, and building community-driven benchmarks and standards for digital language tools.
Strengthening research foundations by creating reliable resources to track progress on digitally disadvantaged languages, investing in, expanding, and evaluating effective AI tools for language digitization, and creating forums for interdisciplinary exchanges.
Improving workflows by moving to parallel workflows for language digitization and leveraging AI for organizational improvements.
Forming coalitions by implementing mechanisms to reform incentive structures surrounding language tool adoption and strengthening storytelling for general audiences.
Ensuring cultural sustainability by empowering culturally aware AI development, promoting contextualized adoption, and impact assessment.
Read Paper
Visualizing the potential role of AI in language digitization
This diagram depicts some of the most promising AI tools and techniques highlighted in the white paper that are currently being applied to ongoing efforts to bring digitally disadvantaged languages into the digital realm. The tools shown here — along a staged pathway for language digitization, from digitally disadvantaged, oral-only languages to languages with advanced digital affordances — are illustrative rather than exhaustive and represent possible applications that may support, but not replace, community-led language planning and governance decisions.
Much of this work rests on painstaking, often unglamorous groundwork. Real-world digitization is nonlinear, iterative, and full of unforeseen complications. As such, the figure offers a simplified vantage point on a far messier reality, highlighting the breadth of areas where AI tools can improve or scale workflows.








