Llama-GENBA-10B: A New Trilingual AI Model Champions German, English, and Bavarian

TLDR: Llama-GENBA-10B is a new 10-billion parameter trilingual large language model built on Llama 3.1-8B, specifically designed to reduce English-centric bias by focusing on German, English, and Bavarian. It was continuously pretrained on 164 billion tokens, with equal parts English and German, and a significant portion of Bavarian. The model addresses challenges in data scarcity, tokenizer design, and evaluation for low-resource languages. Its fine-tuned version achieves state-of-the-art performance in Bavarian, while also performing strongly in English and German, demonstrating an efficient and inclusive approach to multilingual AI development.

A new trilingual large language model, Llama-GENBA-10B, has been introduced, aiming to tackle the common English-centric bias prevalent in many large language models. This innovative model focuses on German, English, and Bavarian, offering a significant step towards more inclusive AI. You can find the full research paper here: Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian.

Developed by a team including Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, and Nicolay J. Hammer, Llama-GENBA-10B is built upon the Llama 3.1-8B architecture and expanded to 10 billion parameters. Its continuous pretraining involved a massive 164 billion tokens, carefully balanced with 82 billion English, 82 billion German, and 80 million Bavarian tokens. This deliberate balance was crucial to prevent English from dominating the model’s learning, ensuring better representation for German and, notably, promoting Bavarian as a low-resource language.

Addressing Key Challenges in Multilingual AI

The development of Llama-GENBA-10B involved overcoming four significant hurdles. First, the team faced the challenge of curating a comprehensive multilingual corpus, especially given the scarcity of Bavarian language resources. This required innovative approaches to data collection and language identification.

Second, a unified tokenizer was essential to effectively process English, German, and Bavarian. The team systematically expanded the Llama-3-8B tokenizer vocabulary, specifically to handle German umlauts (ä, ö, ü) and the eszett (ß) correctly, preventing token fragmentation and preserving semantic integrity. A 20% vocabulary expansion was found to be the optimal balance between performance and computational cost.

Third, optimizing the model’s architecture and the language-ratio hyperparameters was critical for effective cross-lingual transfer. Experiments showed that a balanced 1:1 ratio for English and German data during the initial pretraining phase yielded the best results. Bavarian data was then introduced in the final 10% of training, upsampled to ensure its impact without immediate competition from the larger English and German datasets.

Finally, the researchers established the first standardized trilingual evaluation suite. This involved translating existing German benchmarks into Bavarian, allowing for direct and meaningful cross-lingual performance comparisons across all three languages.

Training and Performance Highlights

The model’s training was conducted on a single Cerebras CS-2 AI Accelerator, demonstrating that large-scale multilingual pretraining can be efficiently managed by smaller research teams. The process also included detailed tracking of energy consumption, providing valuable insights into the computational costs involved.

In evaluations, the base version of Llama-GENBA-10B showed strong performance in English, competitive results in German, and surprisingly robust generalization in Bavarian, ranking among the top models for this dialect. The instruction-tuned variant, Llama-GENBA-10B-instruct, further enhanced these capabilities. It achieved state-of-the-art performance in Bavarian, surpassing other prominent models like Apertus-8B-Instruct-2509 and gemma-2-9B-it. It also outperformed EuroLLM in English and matched its results in German, solidifying its position as a leading model in its class for Bavarian language tasks.

Also Read:

A Blueprint for Inclusive AI

The Llama-GENBA-10B project consumed approximately 35.23 megawatt-hours of electricity over 66 days of pretraining, an important metric for understanding the environmental footprint of large language models. This research not only delivers a high-performing trilingual model but also offers a practical blueprint for developing linguistically inclusive and resource-efficient foundation models, particularly for integrating low-resource languages. Future work will explore instruction-tuning safety, chatbot integration, and human assessment across all three languages, with the potential to apply this approach to other dialects and endangered languages.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Llama-GENBA-10B: A New Trilingual AI Model Champions German, English, and Bavarian

Addressing Key Challenges in Multilingual AI

Training and Performance Highlights

A Blueprint for Inclusive AI

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates