TLDR: A study using Sparse Autoencoders on Gemma-2-2B found that Large Language Models (LLMs) exhibit systematic activation disparities, with medium-to-low resource languages receiving significantly lower internal activations compared to high-resource languages. This disparity correlates with weaker performance on benchmarks, despite similar embedding representations. Activation-aware fine-tuning improved activations for underrepresented languages and led to modest benchmark gains, highlighting activation alignment as key for multilingual LLM performance.
Large Language Models (LLMs) have shown impressive abilities in understanding and generating text across many languages. However, a significant challenge remains: these powerful models often perform less effectively in languages with fewer digital resources, known as medium-to-low resource languages. This disparity is a concern, especially since English data heavily dominates the training datasets for most LLMs, with non-English data making up only a small fraction.
Researchers at the National University of Singapore, Richmond Sin Jing Xuan, Jalil Huseynov, and Yang Zhang, investigated this performance gap in their paper, “Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders”. They aimed to understand why LLMs might struggle with certain languages, even when their internal representations (embeddings) seem to treat all languages similarly.
Peeking Inside LLMs with Sparse Autoencoders
To delve into how LLMs process different languages, the researchers used a technique called Sparse Autoencoders (SAEs). Think of SAEs as a special magnifying glass that allows us to see the “activation patterns” or how much different parts of the LLM “light up” when processing text in various languages. Unlike simply looking at how similar language representations are, SAEs provide direct insights into the neural activity within the model.
The study focused on the Gemma-2-2B model, analyzing its 26 internal layers across 10 languages. These included high-resource languages like Chinese, Russian, Spanish, and Italian, and medium-to-low resource languages such as Indonesian, Catalan, Marathi, Malayalam, and Hindi, with English serving as a reference.
Key Findings: A Clear Disparity
The analysis revealed systematic differences in how LLMs activate for different language groups. The most striking findings were:
-
Lower Activations for Less-Resourced Languages: Medium-to-low resource languages consistently received significantly lower activation levels compared to high-resource languages. This gap was most pronounced in the early layers of the model, where it was up to 26.27% lower, and persisted even in deeper layers, remaining around 19.89%.
-
Correlation with Performance: These lower activation levels directly correlated with weaker performance on common benchmarks like ARC-C, MMLU, and HellaSwag. This suggests that if a language doesn’t “activate” the model’s internal features as strongly, its performance suffers.
-
Embedding Similarity Isn’t Enough: Interestingly, the study found that even when the “embedding similarity” (how similar the model’s overall representation of different languages appeared) was high, the actual performance on tasks for medium-to-low resource languages was still much lower. This highlights that surface-level similarity doesn’t guarantee equitable processing.
Addressing the Imbalance: Activation-Aware Fine-Tuning
To try and mitigate these disparities, the researchers applied a technique called activation-aware fine-tuning using LoRA (Low-Rank Adaptation). This method aimed to increase the activation levels for the underperforming languages while ensuring that the model’s performance on English remained stable.
The fine-tuning led to substantial gains in activation for languages like Malayalam (87.69% increase) and Hindi (86.32% increase), while English retention remained high (around 91%). Post-fine-tuning, benchmark results showed modest but consistent improvements, particularly in tasks like ARC-Challenge for Malayalam, which saw a 5.47% improvement. However, the improvements were not uniform across all benchmarks, indicating that while activation alignment is crucial, it’s not a complete solution on its own.
Also Read:
- Dialect-Linked Biases in AI: How Subtle Data Poisoning Amplifies Harmful Stereotypes in Language Models
- Unveiling the Core Mechanisms of Language Model Training
Looking Ahead
While this study provides valuable insights, it also highlights areas for future work. The translation models used might introduce some errors, and the fine-tuning, while effective in aligning activations, only led to modest benchmark improvements. This suggests that LLMs might not fully converge to shared representations across all languages, and more refined fine-tuning strategies are needed. The findings are also specific to Gemma-2-2B, so further research is needed to see if these patterns hold true for other LLM architectures.
In conclusion, this research underscores that simply having similar language representations isn’t enough for equitable multilingual LLM performance. Understanding and addressing activation disparities through techniques like Sparse Autoencoders and targeted fine-tuning is a vital step towards building more fair and effective multilingual AI models.


