Large Language Models versus Language Communities

This article is part of a series on Artificial Intelligence.

Role of Language Communities in the Era of Large Language Models

Large Language Models (LLMs) such as GPT, PaLM, and Galactica have revolutionized natural language processing by enabling machines to generate coherent and contextually relevant text across many languages. However, the benefits of these models are disproportionately distributed, favoring high-resource languages with extensive digital corpora. This article discusses the critical role of language communities in supporting under-resourced and endangered languages in the context of LLM development and deployment, and explores collaborative strategies involving researchers, policymakers, language activists, and technologists to promote linguistic diversity in AI systems.

Availability of Large Language Models for Selected Languages

Most publicly available LLMs have been predominantly trained on large-scale datasets derived from widely spoken languages, such as English, Mandarin, Spanish, and French. Consequently, many under-resourced languages, particularly Indigenous, minority, and endangered languages, remain underserved. The scarcity of digital text corpora, limited linguistic resources, and lack of annotated datasets challenge the development of effective LLMs for these languages.

The low-resource language problem poses significant risks for digital inequality and cultural erosion, as linguistic communities without adequate technological support face marginalization in the digital era. It is imperative that language communities, together with AI researchers and policymakers, address these challenges through concerted efforts in data collection, curation, and model development.

Documentation and Ethical Data Collection

One of the foundational steps toward supporting under-resourced languages is thorough linguistic documentation. This includes recording oral traditions, collecting written materials, and conducting interviews with native speakers with their informed consent. Ethical data collection practices must prioritize the autonomy, privacy, and cultural sensitivities of language communities to avoid exploitation or misappropriation of linguistic heritage.

Language documentation projects often identify essential vocabulary, phrases, and contextual usage relevant for everyday communication and specialized domains. These data serve as valuable resources for building language models, lexicons, and translation systems. Initiatives such as the Endangered Languages Project and UNESCO's language revitalization programs provide frameworks and best practices for such work.

Improving Data Quality and Quantity for All Languages

Enhancing the data available for under-resourced languages requires multi-faceted approaches, including community participation, crowdsourcing, and partnerships with academic institutions. The use of open-source platforms and accessible tools empowers speakers to contribute linguistic data actively, thus improving both data quantity and quality. Moreover, the digitization of archival texts and oral recordings helps expand corpora available for LLM training.

Even high-resource languages face challenges adapting to evolving terminologies, such as new scientific and technical terms. Language communities play a vital role in coining, translating, and standardizing these terms. Traditionally, official language bodies or academic institutions have led this work. However, with LLMs generating multiple possible translations, ensuring the accuracy and consistency of official terminology has become more complex.

Ensuring Official and Consistent Translations with LLMs

As LLMs generate diverse outputs, including multiple translations for the same term or phrase, language communities must establish mechanisms to validate and endorse official translations. This validation process is critical in specialized fields where precision matters, such as law, medicine, and technology. Collaboration between linguistic authorities, translators, and AI developers is essential to embed such standards into model training and fine-tuning phases.

Feedback from language communities can inform iterative improvement of LLMs. This feedback may include identifying errors, suggesting preferred lexical choices, and highlighting culturally appropriate language usage. Incorporating community input during model fine-tuning ensures that LLMs respect linguistic norms and contribute positively to language preservation and evolution.

Collaboration Between Language Communities and LLM Developers

Successful integration of under-resourced languages into LLMs demands ongoing collaboration between language communities, AI developers, and policymakers. AI companies must engage with communities transparently, respecting data sovereignty and intellectual property rights. Participatory approaches, such as citizen science and crowdsourcing, can harness collective knowledge and foster empowerment.

Institutional support through funding, policy frameworks, and infrastructure is equally vital. International organizations like UNESCO, the United Nations, and the Ethnologue contribute to language preservation agendas that can align with technological development. National language academies and educational institutions play roles in standardization and resource creation.

Data Governance, Licensing, and Ethical Considerations

Data governance frameworks must prioritize ethical concerns, including informed consent, privacy, and cultural sensitivity. Open licenses such as Creative Commons facilitate the sharing of linguistic data while protecting contributors' rights. At the same time, mechanisms for community control over data use, sometimes referred to as data sovereignty, are increasingly recognized as essential for respecting Indigenous and minority groups' interests.

Ensuring transparency about how data are used, stored, and integrated into LLMs is fundamental to building trust between developers and language communities. Ethical AI principles, such as those articulated by the AI Ethics Guidelines of major organizations, reinforce the need for inclusivity and fairness in language technologies.

Conclusion

Language communities are indispensable partners in advancing equitable and effective Large Language Models. Their participation in documentation, data curation, validation, and feedback mechanisms ensures that LLMs reflect linguistic diversity and cultural richness. Through collaborative efforts involving researchers, policymakers, language activists, technologists, and international institutions, it is possible to mitigate digital inequalities and preserve endangered languages in the era of AI.

References

Bender, Emily M., et al. "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, ACM, 2021, pp. 610–623. PDF
Trott, Sean, et al. "Do Large Language Models Know What Humans Know?" Cognitive Science, vol. 47, no. 7, July 2023, p. e13309. Link
"Citizen science." Wikipedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/Citizen_science.
Endangered Languages Project. https://www.endangeredlanguages.com/.
UNESCO Language Revitalization Ecosystems. Best practices and lessons learned to preserve, revitalize and promote Indigenous Languages.
Creative Commons Licenses. https://creativecommons.org/licenses/by/4.0/.
"AI ethics." Wikipedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/AI_ethics.
Ethnologue: Languages of the World. https://www.ethnologue.com.