TLDR: This research paper investigates fairness in Retrieval-Augmented Generation (RAG) systems using Small Language Models (SLMs). It employs metamorphic testing by introducing minor demographic perturbations (race, gender, sexual orientation, age) into prompts. The study reveals that nearly one-third of these perturbations cause fairness violations, with racial cues being the predominant source of bias, originating significantly from the retrieval component itself. The paper introduces the Retriever Robustness Score (RRS) to quantify retrieval-stage bias and provides actionable insights for developers to ensure ethical AI deployments.
Large Language Models (LLMs) are transforming various industries, from healthcare to education, by enabling sophisticated language processing. However, their widespread adoption has brought critical concerns to the forefront, particularly regarding security and fairness. Beyond issues like data poisoning and prompt injection, LLMs can exhibit ‘fairness bugs’—unintended behaviors influenced by sensitive demographic information such as race or sexual orientation, which should ideally not affect the model’s output.
Another significant challenge with LLMs is hallucination, where models generate plausible but false information. To combat this, Retrieval-Augmented Generation (RAG) has emerged as a promising strategy. RAG systems combine an LLM’s generative capabilities with an external knowledge base, allowing the model to retrieve relevant information to ground its responses. While effective in reducing hallucinations, RAG introduces new fairness concerns: the retrieved content itself might contain or amplify existing biases.
A recent research paper, titled Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models, delves into these critical issues. Authored by Matheus Oliveira, Jonathan Silva, and Awdren Fontão from the Federal University of Mato Grosso do Sul, this study investigates fairness in RAG pipelines, specifically focusing on Small Language Models (SLMs).
Uncovering Bias with Metamorphic Testing
The researchers employed a technique called metamorphic testing (MT) to assess fairness. Metamorphic testing is a software testing method that checks if a system’s output remains consistent when its input undergoes controlled, semantically neutral transformations. In the context of fairness, this means introducing minor demographic perturbations (changes) into prompts and observing if the model’s sentiment analysis output changes, which would indicate bias.
The study focused on three popular SLMs hosted on HuggingFace: Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B, each integrated into a RAG pipeline. These SLMs were chosen for their increasing popularity among organizations with limited computing resources, their potential susceptibility to bias compared to larger models, and the existing gap in literature regarding their systematic fairness assessment in RAG.
The Role of the Retriever: A Source of Bias
A key contribution of this research is the introduction of the Retriever Robustness Score (RRS). This metric is specifically designed to diagnose fairness issues at the retrieval stage of a RAG pipeline. Unlike traditional metrics that only look at the final output, RRS quantifies how much the retrieved content’s semantics and labels (e.g., toxicity) shift when demographic cues are added to the input query. A lower RRS indicates a more robust retriever, while a higher score signals significant vulnerability to demographic changes and potential bias.
The findings were striking: minor demographic variations in prompts could break up to one-third of the metamorphic relations, meaning the models’ outputs changed significantly when they shouldn’t have. The study revealed a consistent bias hierarchy across all evaluated models: perturbations involving racial cues were the predominant cause of these fairness violations, accounting for nearly half of all failures. Sexual orientation, gender, and age followed in impact.
Specifically, the retriever component itself showed instability, with its toxicity profile changing in nearly 28.52% of cases due to demographic perturbations. This indicates that bias isn’t just emerging during text generation; it’s being introduced and amplified right from the retrieval stage, where the system fetches information based on the user’s query.
Also Read:
- Measuring Spatial Fairness in AI: The GeoBS Approach
- Unmasking the Vulnerabilities of AI Search Agents: A New Framework for Safety Assessment
Implications for Ethical AI Deployment
These results have significant implications for developers and organizations deploying RAG systems. The research reinforces that the retrieval component in RAG must be carefully curated and tested to prevent bias amplification. It highlights the need for component-level fairness testing, moving beyond just evaluating the final output of an AI system.
The study offers practical recommendations:
- Prioritize testing with race-related perturbations, as they consistently trigger the highest rates of fairness violations.
- Implement continuous monitoring of retrieval behavior using metrics like RRS to detect fairness regressions early.
- Consider fairness robustness alongside traditional performance metrics when selecting SLMs.
- Develop bias mitigation strategies that target both retrieval and generation stages, recognizing that end-to-end approaches alone may not be sufficient.
By providing a systematic methodology and a concrete metric like RRS, this work serves as a practical alert for anyone aiming to adopt accessible SLMs in RAG pipelines without compromising fairness or reliability. It underscores that ensuring fairness in AI systems requires a comprehensive approach that scrutinizes every component of the architecture.


