A roadmap for AI that speaks the world’s languages

For AI to benefit all people, it will need to speak their languages and understand their worlds. AI works well in English, French, Spanish, Hindi, Chinese—but that leaves 4 billion people whose native tongues and local conditions are not well represented. They are being left behind.

Status quo

Based on official technical statistics, one might think that AI models already support most of the world’s languages (for example, Meta claims its automatic speech recognition covers 1,600 languages). However, those technical statistics have not translated into reliability on the ground when used in local communities. There are countless reports of language AI failures: mistranslating agricultural advice, medical advice, or falsely reporting an Indian chief minister dead. AI systems may support a language in a technical sense yet not enable useful services for the people who speak it, like health, agricultural, or business advisors. And the problem is not only language: to be useful, systems must also understand local context. Currently, systems don’t realize what medical treatments are available in different communities, or locally acceptable ways to resolve conflicts.

The core problem is a mismatch: tech firms and their metrics are centralized, but localization is a decentralized problem. Only local communities can assess whether an AI system speaks their language and knows their context. And improving AI must also be decentralized: it will require contributions from dispersed local communities around the globe.

Some of this work is happening. Language nongovernmental organizations have emerged that collect raw data from dispersed communities, such as Masakhane and AI4Bharat. These organizations both conduct research and help to gather datasets. But not enough data has been collected, so organizations that serve low-income populations using AI typically gather their own local data for their specific use case and adapt their own custom models. For example, Jacaranda Health, a Kenya-based nonprofit focused on pregnant women and newborns, adapted its service based on a proprietary corpus of over a million real-world maternal and newborn health question/answer pairs. Such data may be sufficient to make the system usable for one use case, but this is costly, arduous work. And because it is typically collected for internal purposes, its benefits may not fully extend to the ecosystem. Even if the data is shared, it is typically not validated sufficiently to help other organizations or to be incorporated in AI foundation models themselves. Ad hoc, fully decentralized data collection is not a sustainable solution.

We need a better system that integrates central and decentralized players to make sure AI can reach the diverse populations of the world.

First principles

Several first principles should guide any solution:

Reaching languages is not enough. Our goal must be to reach people—with services. How people use language in different locations and domains is complex. Just like US English differs from UK English, a given language may be spoken differently in different communities. Inland Swahili in Democratic Republic of the Congo (DRC) differs from coastal Swahili in Kenya. And many people are multilingual, and commonly code-switch between multiple languages. Those patterns may depend on the domain—for example a Rwandan mother may be most comfortable speaking Kinyarwanda but use French words when discussing health. Providing her with a health service will also require knowing her community’s context: what health conditions are common, what treatments are available, and how she might seek social support from friends and family. Reaching languages is not sufficient; if we seek AI models that are actually useful, our goal must be to enable services that people adopt and use. Given scarce resources, we should strategically prioritize gathering the data that will make beneficial services viable for large groups of people.
Only locals know their language and context. Silicon Valley won’t be able to validate whether a system is intelligible to farmers in rural DRC or children in schools in Haiti. Only local populations will be able to ensure that translations meet their needs. Both validation and data collection thus will need to be decentralized—not just to tech hubs like Bangalore and Nairobi, but to diverse communities themselves.
Some data is more valuable than others. Many organizations quantify data in simple metrics: the number of documents, number of words, or hours of spoken recordings. But it is more valuable to have data on the use cases that are most socially impactful, such as health or agriculture, in the modality that is most accessible, such as voice for groups who struggle with written communication. We will also need to ensure that diverse users are represented: it may be easier to record the speech of young urban males, but their data may go only part of the way towards understanding how an elderly rural woman would be heard over a broken telephone connection. Any solution will need to prioritize what data to collect, and validate that it is actually useful in enabling applications that work for people.
Language data should be shared. If a person shares a bicycle with another, only one can use it at a time. But data is different: if one organization shares data with another, both can use it at the same time; nothing is lost. In economic terms, data is nonrival. In fact, consumers benefit if nonsensitive language data is widely shared. When data is incorporated into a wide variety of services, they can choose: between the model developed by a local implementer, a variety of small regional models, and large models trained by international frontier labs. But if organizations charge to access data, it won’t be incorporated into all models, and consumers who speak minority languages will continue to be omitted from some AI advances. Additionally, in modern AI systems, gathering one type of data tends to improve performance on other unrelated tasks: data on farming conversations will also improve models’ ability to converse on health. That increases the benefits of sharing even further. We need to ensure that language data is not only gathered, but also made widely accessible.
Private companies may not reach the world’s lowest-income people quickly enough. AI companies describe optimism about the potential of the technology for the world’s poor and have made many linguistic advances. However, they alone are unlikely to cross the last mile needed to fully reach the world’s most disadvantaged residents soon enough. These communities may never be a large source of revenue, even for AI companies that are local. And regardless, if one company invests in collecting language data and then widely shares it, it will benefit their competitors as well (in economic terms, there is a free rider problem). Collecting enough data will require coordination.

Local data as a market design problem

Between governments, philanthropies, AI labs, and multilaterals, there is enough interest to make serious progress on delivering AI that works for all. However, it will require moving from ad hoc projects to a coordinated system. There has been success in designing market systems for other vital public goods, through prizes or advance market commitments. These work by securing central funding, and disbursing funds to entities that produce useful outputs. For example, a consortium committed funds that spurred a market delivering pneumococcal conjugate vaccines to millions of children.

Local data can also be set up as a market, where contributors receive rewards for providing local data, which is shared and incorporated into services. This will create a virtuous cycle: as the underlying AI models get better at one language and task, people will use them more, creating more data, and making them better at others.

We don’t have a complete design ready for such a market as of this writing. However, based on the properties of the problem of localization, such a market should have several components:

Validation that is objective and decentralized. Given that central authorities will not be able to completely vet contributions, validation will require relying on members of local communities. Subjective ratings could be gathered through a form of peer review. Such a system could be undermined if people collude to approve each others’ contributions regardless of their quality. A system can be made more robust by requiring review by multiple people, in culturally similar but geographically dispersed communities, and monitoring the quality of validators.
Payment based on how well it enables use cases. A key question is how to design payments for contributions: how valuable is the data? It is hard for a central authority to know, both because of the challenge of vetting contributions, and the challenge of knowing how much it will improve the services available to the community. One possibility would be to design an objective system tied directly to usage: that is, a person is rewarded if the data they contributed enables many more people to access services. However, it is a challenging technical problem to trace how a data contribution is incorporated into model training runs, and affects the downstream quality of services. Further thought is needed here.
Privacy guarantees. Some data is clearly sensitive, like medical conversations and chat histories. And audio recordings of a person’s voice may be sufficient to personally identify them. Any system will need to ensure that individuals’ privacy is adequately protected, and that the people represented understand any tradeoffs and consent to their data being included. Technical solutions can help here: for example, by automatically scanning and removing sensitive information in text, or by separating the style of voice from underlying content. We will need to develop systems and guarantees to protect privacy.
Feedback loops. Once systems have users, their input provides additional data: the language they use when phrasing a question to the system, and how they reply when given a response. Tech firms learn from this usage data to refine their systems. If that could be shared across use cases it could enable faster learning, and provide a cycle of increasing competency. Thus language data may be shared with providers under an agreement that they will provide usage data (such as users’ requests) back to the system. How to do so in a privacy preserving way will also require careful thought.

As is clear, some of these components require new technical solutions. This is both a call to economists, engineers, and computer scientists to work on the underpinnings of this market, and the sector to think systematically about developing the language capabilities to reach the world.

When AI doesn’t speak a person’s language, they may still be exposed to downsides of AI without receiving the benefits. AI is affecting the lives of many around the world. It is time to make sure it works for all.

This piece is written with Andrew Bredenkamp and Han Sheng Chia and is cross-posted at the Center for Global Development.