TLDR: A research paper reveals that tokenization inefficiency in large language models creates a “token tax” for morphologically complex, low-resource languages. Higher “fertility” (tokens per word) consistently predicts lower model accuracy and quadruples training and inference costs. While reasoning models improve performance, they don’t eliminate this systemic bias, highlighting the need for morphologically aware tokenization, fair pricing, and better multilingual benchmarks for equitable NLP.
A recent research paper titled “The Token Tax: Systematic Bias in Multilingual Tokenization” sheds light on a critical issue in the world of large language models (LLMs): how the way words are broken down into “tokens” can create significant disadvantages for certain languages, particularly those that are morphologically complex and have fewer digital resources.
The core problem, as identified by authors Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, and Cody Carroll, is that tokenization inefficiency isn’t just a minor technical glitch. Instead, it imposes structural disadvantages, leading to inflated computing resources and reduced accuracy for many languages. This phenomenon has been dubbed the “token tax.”
The Accuracy Gap: More Tokens, Less Precision
The researchers evaluated 10 different large language models on AfriMMLU, a benchmark dataset comprising 9,000 multiple-choice questions across 5 subjects and 16 African languages. Their findings were stark: a metric called “fertility,” which measures the average number of tokens per word, reliably predicts accuracy. Consistently, higher fertility was associated with lower accuracy across all models and subjects tested. This means that languages requiring more tokens to represent their words tend to perform worse in LLMs.
For instance, African languages, on average, lagged English by about 25 accuracy points. The study found that each additional token per word could reduce accuracy by 8 to 18 percentage points, depending on the subject and model. This systematic erosion of performance underscores that tokenization bias is not accidental but a fundamental issue.
Reasoning Models Offer Hope, But Don’t Eliminate Bias
An encouraging discovery from the research was the performance of “reasoning models” like DeepSeek and o1. These models consistently outperformed their non-reasoning counterparts, especially in high- and low-resource languages within the AfriMMLU dataset. They managed to narrow the accuracy gap between English and African languages, improving performance by 8-12 points on average for African languages. While this is a significant step forward, the paper emphasizes that even these advanced models do not entirely eliminate the inequities rooted in tokenization, suggesting that the problem runs deeper than just model architecture.
The Economic Burden: A Quadrupled Cost
Beyond accuracy, the “token tax” has severe economic consequences. The paper highlights that transformer models, which are the backbone of modern LLMs, scale quadratically with sequence length. This means that if a language requires twice as many tokens to express the same content as another, the training cost and time don’t just double – they quadruple. For example, training a model like Llama-3.1-405B might cost $105 million in English, but $420 million for a language with double the fertility.
Inference costs and latency are similarly affected. Generating 1 million English-equivalent tokens with a model like GPT-4o might cost $5-20, but the same content in a language with 2x fertility could cost $10-40, and the processing time would double. These disparities mean that linguistic diversity translates into a significant computational and financial liability, disproportionately affecting speakers of morphologically complex, low-resource languages.
Also Read:
- Bridging Morphological Gaps: MoVoC for Ge’ez Script NLP
- The AI Fact-Checking Paradox: Accuracy, Confidence, and Global Disparities
Moving Towards Equitable NLP
The study concludes by calling for multi-level interventions to address this systemic bias. These include technical solutions like developing morphologically aware tokenization and more efficient attention mechanisms, economic adjustments such as fair pricing structures that don’t penalize high-fertility languages, and improved benchmarking through the expansion of multilingual evaluation datasets like AfriMMLU. The authors stress that aligning progress across these fronts is crucial to ensure that the benefits of language technology are accessible to billions of speakers worldwide, preventing their exclusion due to an unseen “token tax.”
For a deeper dive into the methodology and detailed results, you can read the full research paper here: The Token Tax: Systematic Bias in Multilingual Tokenization.


