TLDR: Llama-GENBA-10B is a new 10-billion parameter trilingual large language model built on Llama 3.1-8B, specifically designed to reduce English-centric bias by focusing on German, English, and Bavarian. It was continuously pretrained on 164 billion tokens, with equal parts English and German, and a significant portion of Bavarian. The model addresses challenges in data scarcity, tokenizer design, and evaluation for low-resource languages. Its fine-tuned version achieves state-of-the-art performance in Bavarian, while also performing strongly in English and German, demonstrating an efficient and inclusive approach to multilingual AI development.
A new trilingual large language model, Llama-GENBA-10B, has been introduced, aiming to tackle the common English-centric bias prevalent in many large language models. This innovative model focuses on German, English, and Bavarian, offering a significant step towards more inclusive AI. You can find the full research paper here: Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian.
Developed by a team including Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, and Nicolay J. Hammer, Llama-GENBA-10B is built upon the Llama 3.1-8B architecture and expanded to 10 billion parameters. Its continuous pretraining involved a massive 164 billion tokens, carefully balanced with 82 billion English, 82 billion German, and 80 million Bavarian tokens. This deliberate balance was crucial to prevent English from dominating the model’s learning, ensuring better representation for German and, notably, promoting Bavarian as a low-resource language.
Addressing Key Challenges in Multilingual AI
The development of Llama-GENBA-10B involved overcoming four significant hurdles. First, the team faced the challenge of curating a comprehensive multilingual corpus, especially given the scarcity of Bavarian language resources. This required innovative approaches to data collection and language identification.
Second, a unified tokenizer was essential to effectively process English, German, and Bavarian. The team systematically expanded the Llama-3-8B tokenizer vocabulary, specifically to handle German umlauts (ä, ö, ü) and the eszett (ß) correctly, preventing token fragmentation and preserving semantic integrity. A 20% vocabulary expansion was found to be the optimal balance between performance and computational cost.
Third, optimizing the model’s architecture and the language-ratio hyperparameters was critical for effective cross-lingual transfer. Experiments showed that a balanced 1:1 ratio for English and German data during the initial pretraining phase yielded the best results. Bavarian data was then introduced in the final 10% of training, upsampled to ensure its impact without immediate competition from the larger English and German datasets.
Finally, the researchers established the first standardized trilingual evaluation suite. This involved translating existing German benchmarks into Bavarian, allowing for direct and meaningful cross-lingual performance comparisons across all three languages.
Training and Performance Highlights
The model’s training was conducted on a single Cerebras CS-2 AI Accelerator, demonstrating that large-scale multilingual pretraining can be efficiently managed by smaller research teams. The process also included detailed tracking of energy consumption, providing valuable insights into the computational costs involved.
In evaluations, the base version of Llama-GENBA-10B showed strong performance in English, competitive results in German, and surprisingly robust generalization in Bavarian, ranking among the top models for this dialect. The instruction-tuned variant, Llama-GENBA-10B-instruct, further enhanced these capabilities. It achieved state-of-the-art performance in Bavarian, surpassing other prominent models like Apertus-8B-Instruct-2509 and gemma-2-9B-it. It also outperformed EuroLLM in English and matched its results in German, solidifying its position as a leading model in its class for Bavarian language tasks.
Also Read:
- MVPBench: Bridging Global Values in Large Language Models
- Optimizing LLM Performance: Exploring Layer-Wise Scaling for Pre-Training
A Blueprint for Inclusive AI
The Llama-GENBA-10B project consumed approximately 35.23 megawatt-hours of electricity over 66 days of pretraining, an important metric for understanding the environmental footprint of large language models. This research not only delivers a high-performing trilingual model but also offers a practical blueprint for developing linguistically inclusive and resource-efficient foundation models, particularly for integrating low-resource languages. Future work will explore instruction-tuning safety, chatbot integration, and human assessment across all three languages, with the potential to apply this approach to other dialects and endangered languages.


