Skip to main content Skip to secondary navigation
Page Content

Improving Equity and Access to Non-English Large Language Models

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Image
Eight people surround an oversized phone with a Vietnamese flag design theme on-screen. Everyone uses mobile phones themselves. Isometric

iStock

Large language models are well versed in Standard American English and a few other dominant world languages where training data is plentiful, but how do they perform with languages less well represented online?

Not very well, it turns out. 

Take Tamil, for example, a language spoken by over 78 million people and the official language of Sri Lanka and the Indian state of Tamil Nadu. When asked to write a poem in the traditional Tamil style of metered poetry called Venpa, ChatGPT’s English version was a far better representation of the structure and phrasing typical of Venpa than its Tamil counterpart, despite being a style of poetry originating in Tamil. 

How do we create LLMs that serve underrepresented languages and dialects, and how do we evaluate the performance of these LLMs?

This is precisely what Sanmi Koyejo, assistant professor of computer science at Stanford University and an affiliate of the Stanford Institute for Human-Centered AI, and Sang Truong, a PhD student in computer science at Stanford, set out to do when fine-tuning an LLM for Vietnamese. 

“The assumption is that English is the de facto standard for everything because it is particularly used in academic settings, and many of the builders are targeting U.S. and European usage, and the effect has been that the data skews toward particular types of English. The models are much less performant beyond that,” says Koyejo.

Read the full study: Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

 

In addition to creating an open-source Vietnamese LLM, the duo collaborated with scholars from Ho Chi Minh City University of Technology and Ontocord.ai to develop a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Here Koyejo and Truong discuss the key findings of their work, published this spring on preprint service arXiv.

What does the landscape of LLMs look like for non-English languages and Vietnamese, in particular?

Truong: The landscape of Vietnamese LLMs is still in an early stage; most of them are commercial models. We are actually the first to train high-quality open-source Vietnamese models and have an assessment of their performance. Prior to our work, there wasn’t a very rigorous evaluation of Vietnamese models. The standard evaluation was based on question answering, and mostly multiple-choice question answering, and we recognized that this wasn’t reflective of real-world use cases in modern-day life. As a consequence, the public doesn’t have much trust in the models and doesn’t know how to use them. There aren’t a lot of people using them in Vietnam. 

Koyejo: There’s a real effect of losing trust. This has implications for accessibility and democratization of technology worldwide because people’s experiences end up being so bad that they don’t think this technology is for them. 

Are these models accessible to non-English speakers?

Truong: Models like GPT-4 and Gemini can’t be accessed in Vietnam easily. For example, you need to have a U.S. phone number to register for one of these models. You have to pay $20, which isn’t a lot here, but covers food for a week or two in Vietnam. It’s a significant barrier to usage.

Koyejo: Part of our goal was also collaboration with people in these countries to increase access and engagement. We know what that technology could look like in a local context. We have seen some early effects now that some of our tools are available to folks engaging and building models, including thousands of downloads on Hugging Face and enthusiasm for further development from academic institutions as well as industry. 

What are the risks if we don’t have quality LLMs in languages other than English? 

Truong: One of the risks I’m most worried about is that LLMs we see are boosting the productivity of everyone in English-speaking countries. But for countries that don’t have LLMs, they will experience lags in productivity and slowness in participating in these technological revolutions, and it can set back the economic progress of an entire country. We saw this before when certain countries didn’t have access to diesel engines, and their industries lagged behind.

Koyejo: Our broader goal is the democratization of technology. The goal is to find these anchors that capture what is thought to be hard about modeling language, and hopefully find examples of where we can make it better. Vietnamese is an anchor because of its style and linguistic characteristics, which make it different from, say, a Latin-based language. Our work allows for this coverage, such that long term we’re just better at solving this kind of problem and not leaving many languages behind.

What challenges arise when trying to fine-tune an LLM to be useful for other languages?

Truong: Every language has their own unique structure, which means you have to curate a diverse and high-quality dataset to capture the intricacies of the language. We learned that the dataset needs to be very clean, free of toxicity and grammatical errors and inconsistencies. The other challenge is selecting the appropriate base model – LLaMA, in this case – and fine-tuning technique. There are many base models out there that you can fine-tune off of. As far as datasets, one of the interesting features of Vietnamese Wikipedia is that it is a paired dataset, so the model sees the data in English and also English-Vietnamese, and therefore knows some translation. Those sources may end up fine-tuning faster. This is a hypothesis and intuition that we have coming out of this research: that paired datasets are beneficial when fine-tuning non-English models.

What were the main findings of your research?

Koyejo: In the process of training LLMs, one of the first choices to make is: How do I choose to represent language in a format that the computer can understand? This is called the tokenizer. This is a pre-processing step that is underappreciated in English because there are out-of-the-box tools that take care of it for you, but beyond English, this pre-processing tokenization step is important. We found that doing this first step well played a crucial role in the overall performance. This seems to be particularly true beyond English. 

Truong: Another significant finding is that bigger models do not always guarantee better performance. Instead, the performance of an LLM is heavily dependent on the quality and relevance of the data it has been trained on. Bigger models might exhibit more biases than smaller ones. Our research also suggests that building a foundational LLM may not require an extensive amount of data, provided that proper fine-tuning techniques are used. This is due to the model’s ability to transfer knowledge across languages, leveraging the pre-existing linguistic patterns and structures learned from other languages.

What are the implications of this study on LLMs for other languages, particularly those in the Global South?

Truong: By providing a recipe for fine-tuning a wide range of language models for foreign languages, this research opens up new possibilities for developing robust and effective LLMs in underrepresented languages. One of the key contributions of this study is the development of a comprehensive evaluation framework for assessing the performance of these models. The evaluation methodology presented in this study can serve as a valuable template for researchers working on LLMs in other languages.

Koyejo: This research expands the space of people who feel like they can engage and meaningfully think about the ways this kind of technology can benefit them locally. The community building in engaging with people who are primary speakers of the language of this technology was important as well. And that mode of building tools that have cultural sensitivities built in and can work well in local contexts is salient. The better we can do this, the better linguistic diversity and inclusion in the field.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics