Notes by Rajeev Goswami

Insights on AI, Business Travel & Leadership

Artificial intelligence is often described as universal—capable of serving anyone, anywhere, in any language. But beneath this promise lies a structural imbalance.

Today’s leading AI systems—ChatGPT, Gemini, Claude, Llama—are not truly global intelligences. They are English-first systems with multilingual capabilities layered on top. And this distinction has real consequences for how knowledge is created, accessed, and trusted.


AI Doesn’t Just Speak English—It Thinks in It

Modern AI models are trained on internet-scale data, where English dominates. Around 50–55% of web content is in English, according to W3Techs [1], far exceeding its share of the global population of speakers. Large multilingual datasets, such as mC4, and models like mT5 also exhibit a strong skew toward English tokens [2].

The mC4 dataset, a multilingual variant of Google’s C4 corpus that includes text in 108 languages, demonstrates the scale of this disparity. While the dataset represents a significant effort toward multilingual inclusion, the token distribution reveals the depth of language inequality in AI training. English dominates with 2,733 billion tokens, while many other languages receive dramatically smaller allocations. For perspective, languages like Icelandic receive only 2.6 billion tokens despite serving 350,000 native speakers, while Telugu, spoken by 83 million people, has just 1.3 billion tokens. [15]

As a result, even when users ask questions in Hindi, German, or Spanish, models often rely on English-heavy representations and sources during processing.

This leads to what researchers describe as translation-mediated reasoning and can produce less natural outputs in non-English languages [3]. It also affects what the model considers authoritative—since English sources dominate the web and link ecosystems, they are more likely to be retrieved [4].


When AI Ignores Local Reality

In Europe, researchers and audits have observed that AI systems can favor globally visible entities over strong local players due to data visibility and language bias [5].

In India, however, the issue is less about global vs local brands and more about what gets represented at all.

AI performs well in sectors that are digitized and documented in English, but struggles with areas that exist primarily in regional languages or informal systems:

  • Agriculture: AI often provides generic agronomic advice while missing region-specific practices and local extension knowledge
  • Government schemes: Responses capture official summaries but miss state-level implementation and on-ground processes
  • Informal economy: Large parts of India’s economy—such as informal credit and local supply chains—are underrepresented in structured data
  • Cultural knowledge: Hyper-local traditions and practices are sparsely documented online

FAO & World Bank note that digital agriculture knowledge in developing countries is highly fragmented and under-digitized.

Digital systems tend to capture formal, documented knowledge—while informal and vernacular knowledge remains underrepresented [6].


Token Disparity Table

This table visualizes the “Data Poverty” gap, highlighting that population size does not correlate with AI representation.

LanguageEstimated Native SpeakersmC4 Tokens (Approx.)Representation Gap
English380 Million2,733 BillionOver-represented
Spanish485 Million164 BillionModerate
Telugu83 Million1.3 BillionCritical Under-representation
Hindi345 Million1.7 BillionCritical Under-representation
Icelandic350,0002.6 BillionHigh (per capita)

A Shift Toward Cultural Homogenization

AI responses on social or political topics often draw from globally visible, English-language sources. This can subtly shape how local issues are framed [7].

Recent studies show that language models encode cultural and value biases, often aligning more closely with Western perspectives even across languages [8].

Diversity in language does not guarantee diversity in worldview.


When Bias Becomes a Safety Issue

The problem becomes more serious in domains like healthcare.

Research evaluating multilingual LLM performance finds that accuracy and consistency can vary across languages, often being stronger in English [9].

This is partly due to the dominance of English-language medical literature and partly because healthcare is culturally contextual—symptoms, descriptions, and care pathways vary across regions.

In multilingual settings like India, this creates a gap between language support and true contextual understanding.


The Hidden Bias in Design

Bias is not just in data—it is also in design.

AI systems are typically built around:

  • English interaction patterns
  • Structured queries
  • Western institutional assumptions

Research in human-computer interaction and global computing highlights how such systems often fail to align with the needs of multilingual, low-resource users [10]

AI is not just English-trained—it is English-centered in design assumptions.


Rethinking Multilingual AI

The “curse of multilinguality” suggests that adding more languages reduces performance. However, recent work shows that data quality and balance matter more than sheer scale [11].

Carefully curated multilingual data can significantly improve performance across languages—and even benefit English.


What Needs to Change

The path forward requires a shift from scale to intentional design:

  • Build native-language datasets, not just translations.
  • Invest in localized models and use cases.
  • Focus on data curation and quality.
  • Improve fluency using methods like Direct Preference Optimization (DPO)—a training approach where models learn from human preference comparisons [12]
  • Develop culturally aware systems, not just multilingual ones.

Efforts like the BigScience ROOTS corpus demonstrate this direction—a large multilingual dataset built collaboratively with native contributors to improve representation and diversity [13].

Sovereign AI Movement: Many nations (India with “Bhashini,” France with “Mistral”) are building their own models specifically to combat the “English Brain”.


The Bigger Question

AI is becoming a primary interface for knowledge and decision-making.

If its foundations remain disproportionately English:

  • It may reinforce existing inequalities.
  • It may underrepresent non-English knowledge systems.
  • It may standardize a narrow worldview.

Beyond Translation: Join the Movement for Local AI 

The “English Brain” of AI isn’t an inevitability—it’s a data gap we can bridge. If we want AI to reflect our world, we must actively participate in its construction.

  • Contribute to Open Datasets: Support initiatives like Common Voice or BigScience by donating speech and text in your native language.
  • Audit Your Tools: When using AI for local business or healthcare, always cross-reference its advice with vernacular sources.
  • Demand Sovereign AI: Support policies that prioritize the development of “Sovereign AI” models trained on local, culturally specific data rather than just translated English scrapings.

Let’s build an AI that doesn’t just translate our words, but understands our world.


Conclusion: From Multilingual to Multicultural AI

The goal is not just to make AI speak more languages.

The goal is to make AI:

  • Understand different ways of thinking.
  • Reflect diverse realities
  • Respect cultural context

Until then, we are not building global intelligence.

We are exporting one.


References 

  1. W3Techs – Usage of content languages for websites
  2. https://w3techs.com/technologies/overview/content_language
  3. Xue et al. (2021) – mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
  4. Lai et al. (2023) – On the Multilingual Capabilities of LLMs / Translation Effects
  5. Bender et al. (2021) – On the Dangers of Stochastic Parrots
  6. European Commission / OECD discussions on AI bias & localization (2023–24 policy reports)
  7. World Bank (2021) – World Development Report: Data for Better Lives
  8. Dwivedi et al. (2023) – So What If ChatGPT Wrote It?
  9. Naous et al. (2023) – Multilingual Language Models Encode Cultural Biases
  10. Bang et al. (2023) – Multilingual LLM Evaluation
  11. Heeks (2018) / ICT4D literature – Digital inequality & local context gaps
  12. Pfeiffer et al. (2020) – MAD-X: Multilingual Transfer Learning
  13. Rafailov et al. (2023) – Direct Preference Optimization (DPO)
  14. Laurençon et al. (2022) – The BigScience ROOTS Corpus
  15. https://datajungleadventures.com/2025/06/04/the-great-language-divide-how-unequal-distribution-in-ai-training-data-shapes-our-digital-future/


Enjoyed this post?
Subscribe on Substack to receive my latest writing directly in your inbox.

Leave a Reply

Discover more from Notes by Rajeev Goswami

Subscribe now to keep reading and get access to the full archive.

Continue reading