TLDR: This research paper explores the adaptation of Large Language Models (LLMs) to low-resource regional dialects, using Québec French as a case study. The authors demonstrate that compute-efficient continual pre-training (CPT) with low-rank adaptation (LoRA) can significantly improve LLM performance on a minority dialect, even with a very small corpus of 86 million tokens and by updating less than 1% of model parameters. While smaller models struggled with balancing dialect acquisition and retaining general language skills, larger models showed clear improvements in both. The study emphasizes the critical role of corpus composition, noting that informal training data can impact normative tasks and the absence of specific data types (like Q&A) can affect performance on related tasks. The work contributes to linguistic equity by providing cost-effective methods for creating high-quality LLMs for underserved linguistic communities, releasing the first open-weight Québec French LLMs.
Large Language Models (LLMs) have become incredibly powerful tools, driving advancements in areas like text summarization, content generation, and dialogue systems. However, their full potential often remains limited to a handful of high-resource languages, primarily English, due to the abundance of training data available for them. This creates a ‘dialect gap,’ where regional dialects and minority language varieties are underserved, leading to inequities in access to AI technologies.
A recent research paper, “Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study” by Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, and Leila Kosseim, addresses this challenge by exploring how to adapt LLMs to low-resource regional dialects, specifically focusing on Québec French. The study investigates the use of Continual Pre-training (CPT) combined with Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) to achieve this adaptation with tight data and compute budgets.
Bridging the Dialect Gap with Continual Pre-training
The core idea behind CPT is to continue training an already pre-trained LLM on a new, domain-specific corpus. This allows the model to extend its capabilities to the new domain or dialect without completely discarding the general knowledge it acquired during its initial pre-training. Full model training is prohibitively expensive, and traditional fine-tuning can lead to overfitting on small dialectal datasets. CPT offers a practical compromise by exposing models to dialect-specific text, enhancing regional linguistic coverage.
To make this adaptation feasible with fewer computational resources, the researchers employed LoRA, which allows for updates to only a small fraction of the model’s parameters while preserving the original pre-trained weights. This, combined with gradient checkpointing, meant that less than 1% of the full model’s parameters needed to be updated, making dialect adaptation possible on modest hardware.
The Québec French Case Study
The paper uses Québec French, or Québécois, as its case study. Québécois differs significantly from the prestige dialect of French spoken in France, with variations in orthography, vocabulary, idioms, and even code-switching patterns. Crucially, resources for Québécois are scarce, making it an ideal candidate for low-resource adaptation.
The researchers meticulously collected a diverse corpus of Québec French documents totaling 86.57 million tokens. This corpus included formal texts like public-domain e-books and Wikipedia articles, as well as informal texts such as interview transcripts, Facebook comments, and forum posts from platforms like Depotoir.ca, MontrealRacing.com, YouTube, and Reddit. This mix was designed to capture the rich sociolinguistic variations of the dialect.
Training and Evaluation
Three different LLMs were adapted: CroissantLLMChat-v0.1 (1.35B parameters), Llama-3.2-1B, and Llama-3.1-8B. These models underwent CPT for three and six epochs using a causal language modeling objective. The adapted models were then evaluated on a subset of the COLE French-language benchmark, which includes both Québec French-specific tasks (QFrCoLA, QFrBLiMP, QFrCoRE, QFrCoRT) and general prestige French tasks (AlloCiné, PAWS-X, Fr-BoolQ, MMS).
Key Findings and Insights
The results showed that all three models rapidly absorbed dialectal patterns, indicated by a sharp drop in perplexity after the first epoch of CPT. After six epochs, all models improved their performance on Québec French tasks. However, the improvements were not uniform. For instance, the QFrCoLA task, which assesses grammatical acceptability according to normative Québec French rules, proved challenging. The models sometimes struggled with this task because a significant portion of the training data came from unedited, informal sources where common linguistic mistakes are prevalent, leading the models to accept these as ‘correct.’
Interestingly, the study found that the larger models, particularly Llama-3.1-8B, not only improved their understanding of Québec French but also retained, and in some cases even improved, their performance on prestige French tasks. Smaller models, like Llama-3.2-1B, showed more difficulty in balancing dialect adaptation with the retention of general French skills, sometimes degrading in overall performance. This suggests that a sufficiently large base model is crucial for effectively absorbing new dialectal information without forgetting previously learned knowledge.
The research also highlighted the significant impact of corpus composition. The inclusion of informal, unedited texts, while valuable for capturing authentic dialect, made the models less proficient at distinguishing normatively correct from incorrect text. Similarly, the lack of question-answering sources in the training data led to a decline in performance on the Fr-BoolQ task.
Also Read:
- Predicting Language Model Adaptation Performance Across Pre-Training Stages
- Enhancing Model Adaptation with Alpha-LoRA: A New Fine-Tuning Approach
Implications and Future Directions
This work demonstrates that CPT with PEFT can be a cost-effective and sustainable method for creating language resources for minority linguistic communities, thereby expanding access to high-quality LLMs. The researchers have released the first open-weight LLMs adapted specifically to Québec French on HuggingFace and provided training configurations and data-processing scripts on GitHub for broader use.
The study also touches upon important societal and ethical considerations, emphasizing linguistic equity and preservation. It acknowledges potential representation biases, as the corpus might skew towards urban, younger, internet-active speakers, and advocates for designs that preserve dialectal variation rather than normalizing it towards prestige forms. Future work could explore different data source mixes, cross-dialect CPT, and techniques for enhancing language retention.


