The AI Revolution Is Crushing Thousands of LanguagesStory
- IOAI Canada
- Apr 12, 2024
- 2 min read
by Matteo Wong
Recently, Bonaventure Dossou learned of an alarming tendency in a popular AI model. The program described Fon—a language spoken by Dossou’s mother and millions of others in Benin and neighboring countries—as “a fictional language.”
This result, which I replicated, is not unusual. Dossou is accustomed to the feeling that his culture is unseen by technology that so easily serves other people. He grew up with no Wikipedia pages in Fon, and no translation programs to help him communicate with his mother in French, in which he is more fluent. “When we have a technology that treats something as simple and fundamental as our name as an error, it robs us of our personhood,” Dossou told me.
The rise of the internet, alongside decades of American hegemony, made English into a common tongue for business, politics, science, and entertainment. More than half of all websites are in English, yet more than 80 percent of people in the world don’t speak the language. Even basic aspects of digital life—searching with Google, talking to Siri, relying on autocorrect, simply typing on a smartphone—have long been closed off to much of the world. And now the generative-AI boom, despite promises to bridge languages and cultures, may only further entrench the dominance of English in life on and off the web.
Scale is central to this technology. Compared with previous generations, today’s AI requires orders of magnitude more computing power and training data, all to create the humanlike language that has bedazzled so many users of ChatGPT and other programs. Much of the information that generative AI “learns” from is simply scraped from the open web. For that reason, the preponderance of English-language text online could mean that generative AI works best in English, cementing a cultural bias in a technology that has been marketed for its potential to “benefit humanity as a whole.” Some other languages are also well positioned for the generative-AI age, but only a handful: Nearly 90 percent of websites are written in just 10 languages (English, Russian, Spanish, German, French, Japanese, Turkish, Portuguese, Italian, and Persian).
Some 7,000 languages are spoken in the world. Google Translate supports 133 of them. Chatbots from OpenAI, Google, and Anthropic are still more constrained. “There’s a sharp cliff in performance,” Sara Hooker, a computer scientist and the head of Cohere for AI, a nonprofit research arm of the tech company Cohere, told me. “Most of the highest-performance [language] models serve eight to 10 languages. After that, there’s almost a vacuum.” As chatbots, translation devices, and voice assistants become a crucial way to navigate the web, that rising tide of generative AI could wash out thousands of Indigenous and low-resource languages such as Fon—languages that lack sufficient text with which to train AI models.
“Many people ignore those languages, both from a linguistic standpoint and from a computational standpoint,” Ife Adebara, an AI researcher and a computational linguist at the University of British Columbia, told me. Younger generations will have less and less incentive to learn their forebears’ tongues. And this is not just a matter of replicating existing issues with the web: If generative AI indeed becomes the portal through which the internet is accessed, then billions of people may in fact be worse off than they are today.




Comments