We Can Do This! Small Language Models and Minority Language AI Translation

Mar 11

Dr. Gilles Gravelle, Executive Director, Moving Missions

My first Technolink article on AI generative translation and minority languages appeared in the Spring 2023 Issue.[i] That article explained how AI natural language processing and the current generation of neural network deep learning has made it possible and feasible for minority languages to receive machine translations of essential information that has been largely missing from their everyday lives. This missing information causes missed opportunities and exploitation. It can hinder their health, educational development, and economic advancement, while leaving them in unsafe situations. This article will discuss the advantages small language models have over large language models for minority language translation.

The Advantage of Small Language Models (SLM)

Google Translate and OpenAI’s ChatGPT are large language models (LLM). In fact, they are the largest LLMs for generative AI. Google claims to have achieved a 1.6 trillion parameter model. Parameters are what the machine learns from all the data used to train the model. ChatGPT’s model has 175 billion parameters. Such large data sets and parameters are needed to generate translations between hundreds of languages on nearly unlimited topics. However, no matter how large an LLM may be, it can’t do minority language translation because it hasn’t been trained to do so. Even so, Meta’s No Language Left Behind is experimenting with building generative models to translate across 200 low resource languages,[1] with a goal of providing machine translation for most of the world’s languages.[ii]

With the power of today’s deep learning algorithms and neural network models, minority languages with low to no resources for AI to draw from are not actually at a disadvantage. According to Avodah Inc’s former chief science officer, Trevor Chandler, large language models require large amounts of data because it has to do so much beyond communicating in a language it was trained on. It also summarizes, extracts, and answers questions. But with such huge data needed to generate a feature, such as translation, analysis, and summary, it can lead to unpredictable, incorrect, or even made-up hallucinated results. A hallucinated result is a nonsensical or inaccurate result.[iii] According to Chandler, the key is limiting the reference data the AI model accesses to produce a translation. In this case, less is better.[iv]

To translate information into a minority language, the machine only needs to learn one language. It’s not learning one language and then learning a lot of other languages to be able to communicate across those languages. So training the machine goes fast for single languages. And instead of having to filter out unwanted data from LLMs, the data grows better with a SLM because it only learns what you provide.

Native Speakers as Machine Trainers

It’s not unusual for a translation generated by Google Translate or ChatGPT to sound unnatural or even wrong while still comprehendible to a native speaker. Machine translation across many languages struggle with complicated grammatical relationships, nuance, and conceptual meaning units of natural language. The beauty of SLMs for minority language translation is that native speakers can train the machine to speak their language more naturally. Working with an AI technologist, the team translates a variety of carefully curated content in the target language. The goal is not to add large amounts of data for translation. It’s about choosing the right variety of texts and genres so the machine can learn vocabulary, grammatical relationships, and understand complex relationships within different genres.

Advantages of SLM/Native Speaker Training

The training process uses a smaller number of parameters, meaning fewer variables to learn language patterns and context. It can get by with smaller data sets for training the machine, and it requires far less computational power and training time, thus providing efficiency, speed, and cost savings. And importantly, native speakers review and edit translation output thus producing improved generative translations. It’s not unusual to achieve 80-90% quality soon after producing the first generated translations.[v]

Applications

A team of 18 native speakers of a minority language dialect in Burkina Faso are building an AI SLM to produce a translation of the Bible for their people. They translate about 1,000 verses from different parts of the Bible source language into their language to train the machine on Bible vocabulary and grammar. They also translate non-biblical content to further add natural language understanding to the model. Iterative editing and feedback of the translation output by an expanded group of native speakers helps the machine learn to produce more accurate and natural sounding translations, quickly achieving a quality level that LLMs struggle to achieve.

Once the machine training phase is completed, it is easy to translate other essential topics based on the people’s circumstances, such as health and safety, financial literacy, civic responsibilities, cultural awareness, life skills, and emergency preparedness.[vi]

Faster, Cheaper, and Better

A full Bible can be translated within 2-3 years at a cost of about $350,000 -- $500,000. Compare this to manual translation, which could take at least 15 years at an estimated cost of 2-3 million dollars. Now that the minority language group has a fluent, natural sounding AI translation tool, the cost of producing other essential information is small in comparison to the development stage. That opens the door to receiving more essential life-changing information.

Gilles Gravelle, PhD.

[1] A low resource language lacks the data to generate a good translation if any translation at all.

[i] https://mailchi.mp/a9f6cc55c85c/technolink-association-9901434?e=88d50acf1f

[ii] https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/

[iii] https://www.ibm.com/topics/ai-hallucinations#:~:text=AI%20hallucination%20is%20a%20phenomenon,are%20nonsensical%20or%20altogether%20inaccurate.

[iv] Avodah.com. Personal conversation.

[v] Gilles Gravelle. 2023. Today’s AI NLP – A Game Changer for Bible Translation. Paper given at the 2023 BT Conference, Dallas Tx.

[vi] This list was generated by ChatGPT when asked what essential information for people is.

Gilles Gravelle is the Executive Director of Moving Missions. He is also the Director of Research & Innovation for Seed Company. His research, writing, and consulting cover a variety of disciplines, including missiology, strategic planning, impact evaluation, fundraising, and mission philanthropy. With 40 years of international experience, Gilles has a broad understanding of trends and changes taking place in missions and nonprofit development. He is the author of several books, including Impact-Driven Philanthropy, The Age of Global Giving, and So What? Answering a Donor’s Toughest Question. He earned an MA in applied linguistics from Darwin University, Australia and a Ph.D. in general linguistics from Free University, Amsterdam.

Ssusan Forte O'Neill

We Can Do This! Small Language Models and Minority Language AI Translation

Cultured Beefsteak? Aleph Farms Promotes a Solution to the Food Security Crisis

Human Factors in Technology Adoption

Technolink Associationinfo@fortedesigns.net

Technolink Association
info@fortedesigns.net