GAPERON: Unveiling a New Suite of Open French-English Language Models

TLDR: GAPERON is an open-source suite of French-English-coding language models (1.5B, 8B, 24B parameters) trained on trillions of tokens, emphasizing transparency and reproducibility. The research explores how data filtering for quality improves fluency but can lower benchmark scores, while late, deliberate contamination with test sets can recover competitive scores with moderate generation quality impact. GAPERON also introduces harmless data poisoning for safety research and releases all models, datasets, and code to foster open science in multilingual LLM development.

The GAPERON project introduces a comprehensive suite of open-source language models designed to advance transparency and reproducibility in large-scale model training. Developed by the ALMAnaCH team at Inria Paris, this initiative focuses on French, English, and coding languages, offering models with 1.5 billion, 8 billion, and 24 billion parameters. These models were trained on vast datasets, ranging from 2 to 4 trillion tokens, and the researchers have made every element of their training pipeline publicly available. This includes meticulously filtered French and English datasets, an efficient framework for data curation and training, and hundreds of intermediate checkpoints, providing an unprecedented level of openness for the research community.

Exploring Data’s Influence on Model Performance

A core aspect of the GAPERON research involves a deep dive into how data filtering and contamination interact to shape both benchmark scores and the quality of generated text. The team made a significant discovery: filtering data specifically for high linguistic quality, while enhancing text fluency and coherence, can paradoxically lead to lower scores on traditional benchmarks. This suggests a potential trade-off between generating naturally flowing, high-quality text and achieving top performance on standardized tests.

Furthermore, the study explored “late deliberate contamination,” a process where test sets from benchmarks are intentionally included in the training data during later stages. They found that this strategy could recover competitive benchmark scores, bringing the models closer to state-of-the-art performance, with only a reasonable impact on the overall quality of text generation. The researchers also shed light on how common neural filtering techniques, often used to select high-quality data, can unintentionally amplify the presence of benchmark data in training mixes, leading to what they term “benchmark leakage.”

A Testbed for AI Safety: Harmless Data Poisoning

To support further research into the critical area of AI safety, GAPERON introduces a unique feature: harmless data poisoning during the pre-training phase. This involves injecting specific “trigger sequences” designed to induce language switching (e.g., from English to French or German) and fictional knowledge into the training data. By openly releasing these “poisoned” models, GAPERON provides a realistic testbed for safety studies, allowing researchers to investigate model vulnerabilities, develop backdoor detection mechanisms, and explore defenses against pre-training poisoning attacks at scales relevant to modern large language models. The experiments showed that these triggers were highly effective, with larger GAPERON models achieving near-perfect activation rates for language switching.

Also Read:

The GAPERON Model Variants and Training Insights

The GAPERON family includes several variants, such as “Young,” “Pepper,” and “Garlic.” The “Young” models were trained primarily on high-quality data, while “Pepper” models underwent further training with increasing ratios of supervised fine-tuning-like data. The “Garlic” variants were specifically designed to investigate deliberate benchmark contamination, incorporating benchmark test sets into their training. The research revealed that while “Young” models excelled in generative capabilities, they often lagged in benchmark scores. The “Garlic” models, despite their benchmark-focused training, showed that the benefits of contamination were not limitless and came with some degradation in creative and semantic aspects of generation.

The project also details practical challenges encountered during the 15-month development period, including data preparation issues, multiprocessing errors, and training instabilities. The team developed a highly efficient and “hackable” codebase, Gapetron, compatible with both AMD and NVIDIA GPUs, achieving competitive training throughputs. This commitment to open-source tools and detailed reporting underscores the project’s dedication to fostering reproducible research in the field of large language models. For more in-depth technical details, you can refer to the full research paper available at arXiv:2510.25771.

By openly releasing all models, datasets, code, and checkpoints, GAPERON establishes a reproducible foundation for exploring the intricate trade-offs between data curation, evaluation methodologies, safety considerations, and the overarching principle of openness in the development of multilingual language models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GAPERON: Unveiling a New Suite of Open French-English Language Models

Exploring Data’s Influence on Model Performance

A Testbed for AI Safety: Harmless Data Poisoning

The GAPERON Model Variants and Training Insights

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates