Advancing Arabic Language AI with HALA Models

TLDR: The HALA project introduces a new family of Arabic-centric instruction and translation models. By compressing a powerful English-Arabic translator and using it to create a large, high-quality Arabic instruction dataset, HALA models achieve state-of-the-art performance on Arabic benchmarks across various sizes (350M to 9B parameters). The methodology involves an efficient translate-and-tune pipeline, addressing the scarcity of quality Arabic instruction data and promoting language-centric AI development.

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have made incredible strides, but their performance often varies significantly across different languages. For Arabic, a language rich in morphology and diverse dialects, there has been a persistent challenge: a scarcity of high-quality instruction data. This bottleneck limits the development of truly capable Arabic-centric AI.

A new research initiative, HALA, aims to address this by introducing a family of Arabic-centric instruction and translation models. Named after the Arabic word “Hala,” which conveys sweetness and beauty, these models are designed to deepen AI capabilities specifically for the Arabic language.

The HALA Approach: Translate and Tune

The core of the HALA project is an innovative “translate-and-tune” pipeline. This process begins by taking a powerful English-Arabic (AR↔EN) teacher translation model and making it incredibly efficient. The researchers compressed this model to FP8, a technique that significantly boosts processing speed (about twice as fast) without sacrificing translation quality. This efficient translator then becomes a crucial tool for generating high-fidelity bilingual data.

With this powerful translator, the HALA team embarked on creating a massive Arabic instruction corpus. They started by translating over 400,000 instruction-response pairs from the Open-Orca dataset, covering both user questions and AI assistant responses. To further enhance the quality of their bilingual data, they also meticulously filtered a large parallel corpus from OPUS-100, retaining nearly 440,000 high-fidelity Arabic-English pairs using a strict bilingual judge model.

This combined dataset, totaling around 1.26 million bilingual examples, was then used to fine-tune a lightweight language model, LiquidAI/LFM2-1.2B. This specialized model became a fast and stable AR↔EN translator, particularly adept at handling instruction-style inputs.

Building a Million-Scale Arabic Instruction Corpus

Leveraging their newly trained lightweight translator, the HALA team converted several high-quality English instruction datasets into Arabic, preserving their original formatting and answer styles. These included prominent datasets like Hermes 3, SCP-116K, ReAlign-Alpaca, LaMini, Tulu 3, Synthetic Instruct-GPT-J Pairwise, and additional Open-Orca samples. The result is a vast Arabic corpus, comprising millions of instruction-response pairs, specifically designed to improve instruction following, reasoning, and alignment in Arabic LLMs.

HALA Models: Performance Across Scales

The HALA models were then trained at various scales: 350 million, 700 million, 1.2 billion, and 9 billion parameters. An interesting technique called “slerp merging” was applied, which helps balance the specialized Arabic capabilities gained from the new data with the general strengths of the base models. This ensures that HALA models are not only excellent in Arabic but also maintain broad competence.

On a suite of Arabic-centric benchmarks, HALA models have demonstrated impressive results. In the “nano” category (models up to 2 billion parameters), HALA-1.2B significantly outperformed its base model and achieved the best average score. Similarly, HALA-350M and HALA-700M consistently showed improvements. For the “small” category (7-9 billion parameters), HALA-9B surpassed the previous state-of-the-art baseline, QCRI/Fanar-1-9B-Instruct, on average metrics.

These findings underscore the effectiveness of the language-centric approach, proving that dedicated tuning on high-fidelity Arabic instruction data can significantly boost performance across different model sizes. The research also included an evaluation of the translation quality, confirming that the specialized HALA translator achieved high fidelity in converting English MMLU questions to Arabic.

Also Read:

Open-Sourcing for Future Research

The HALA project is committed to fostering further research in Arabic Natural Language Processing (NLP). The team is releasing their models, the newly created Arabic instruction data, evaluation scripts, and training recipes. This open-source approach aims to accelerate advancements and encourage more language-centric AI development, complementing the broader multilingual efforts in the field. You can find more details about this work in the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Arabic Language AI with HALA Models

The HALA Approach: Translate and Tune

Building a Million-Scale Arabic Instruction Corpus

HALA Models: Performance Across Scales

Open-Sourcing for Future Research

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Gabriel Marketing Group Introduces Generative Engine Optimization (GEO) Content Services for B2B Technology Companies Amidst AI Evolution

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates