spot_img
HomeResearch & DevelopmentGAPERON: Unveiling a New Suite of Open French-English Language...

GAPERON: Unveiling a New Suite of Open French-English Language Models

TLDR: GAPERON is an open-source suite of French-English-coding language models (1.5B, 8B, 24B parameters) trained on trillions of tokens, emphasizing transparency and reproducibility. The research explores how data filtering for quality improves fluency but can lower benchmark scores, while late, deliberate contamination with test sets can recover competitive scores with moderate generation quality impact. GAPERON also introduces harmless data poisoning for safety research and releases all models, datasets, and code to foster open science in multilingual LLM development.

The GAPERON project introduces a comprehensive suite of open-source language models designed to advance transparency and reproducibility in large-scale model training. Developed by the ALMAnaCH team at Inria Paris, this initiative focuses on French, English, and coding languages, offering models with 1.5 billion, 8 billion, and 24 billion parameters. These models were trained on vast datasets, ranging from 2 to 4 trillion tokens, and the researchers have made every element of their training pipeline publicly available. This includes meticulously filtered French and English datasets, an efficient framework for data curation and training, and hundreds of intermediate checkpoints, providing an unprecedented level of openness for the research community.

Exploring Data’s Influence on Model Performance

A core aspect of the GAPERON research involves a deep dive into how data filtering and contamination interact to shape both benchmark scores and the quality of generated text. The team made a significant discovery: filtering data specifically for high linguistic quality, while enhancing text fluency and coherence, can paradoxically lead to lower scores on traditional benchmarks. This suggests a potential trade-off between generating naturally flowing, high-quality text and achieving top performance on standardized tests.

Furthermore, the study explored “late deliberate contamination,” a process where test sets from benchmarks are intentionally included in the training data during later stages. They found that this strategy could recover competitive benchmark scores, bringing the models closer to state-of-the-art performance, with only a reasonable impact on the overall quality of text generation. The researchers also shed light on how common neural filtering techniques, often used to select high-quality data, can unintentionally amplify the presence of benchmark data in training mixes, leading to what they term “benchmark leakage.”

A Testbed for AI Safety: Harmless Data Poisoning

To support further research into the critical area of AI safety, GAPERON introduces a unique feature: harmless data poisoning during the pre-training phase. This involves injecting specific “trigger sequences” designed to induce language switching (e.g., from English to French or German) and fictional knowledge into the training data. By openly releasing these “poisoned” models, GAPERON provides a realistic testbed for safety studies, allowing researchers to investigate model vulnerabilities, develop backdoor detection mechanisms, and explore defenses against pre-training poisoning attacks at scales relevant to modern large language models. The experiments showed that these triggers were highly effective, with larger GAPERON models achieving near-perfect activation rates for language switching.

Also Read:

The GAPERON Model Variants and Training Insights

The GAPERON family includes several variants, such as “Young,” “Pepper,” and “Garlic.” The “Young” models were trained primarily on high-quality data, while “Pepper” models underwent further training with increasing ratios of supervised fine-tuning-like data. The “Garlic” variants were specifically designed to investigate deliberate benchmark contamination, incorporating benchmark test sets into their training. The research revealed that while “Young” models excelled in generative capabilities, they often lagged in benchmark scores. The “Garlic” models, despite their benchmark-focused training, showed that the benefits of contamination were not limitless and came with some degradation in creative and semantic aspects of generation.

The project also details practical challenges encountered during the 15-month development period, including data preparation issues, multiprocessing errors, and training instabilities. The team developed a highly efficient and “hackable” codebase, Gapetron, compatible with both AMD and NVIDIA GPUs, achieving competitive training throughputs. This commitment to open-source tools and detailed reporting underscores the project’s dedication to fostering reproducible research in the field of large language models. For more in-depth technical details, you can refer to the full research paper available at arXiv:2510.25771.

By openly releasing all models, datasets, code, and checkpoints, GAPERON establishes a reproducible foundation for exploring the intricate trade-offs between data curation, evaluation methodologies, safety considerations, and the overarching principle of openness in the development of multilingual language models.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -