Fine-tuning RAG: How Different Strategies Impact AI Performance and Cost

TLDR: A research paper compares independent, joint, and two-phase fine-tuning strategies for Retrieval-Augmented Generation (RAG) systems. It finds that all strategies achieve similar performance improvements in generation quality (EM, F1) but have significantly different computational costs. The optimal strategy depends on the availability of context labels and the need for learning rate optimization, with independent fine-tuning being cheapest when context labels are present, and two-phase being best for efficient learning rate searches without context labels.

Retrieval Augmented Generation, or RAG, has emerged as a powerful framework for tasks like question answering in natural language processing. At its core, RAG combines two large language models (LLMs): an embedding model that intelligently retrieves relevant context documents from a vast database, and a generator model that then uses this retrieved information to formulate an answer to a given question.

To enhance the performance of a RAG system for new tasks, both the embedding and generator models can be fine-tuned. However, choosing the right fine-tuning approach can be complex, as different strategies come with varying computational costs and benefits. This research paper, titled “A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation,” explores and evaluates several of these strategies.

The authors, Neal Lawton, Alfy Samuel, Anoop Kumar, and Daben Liu, delve into independent, joint, and two-phase fine-tuning methods. Their findings indicate that while all these strategies can lead to similar improvements in generation quality metrics like EM (Exact Match) and F1 scores, their computational expenses differ significantly. The optimal strategy, they conclude, depends on whether your training data includes specific “context labels” and if a thorough search for the best learning rates for both models is necessary.

Understanding the Fine-tuning Approaches

Independent Fine-tuning: This method involves fine-tuning the embedding model and the generator model separately. The embedding model is trained to retrieve more relevant documents using datasets where questions are explicitly paired with correct context documents (context labels). The generator model is then fine-tuned to produce accurate answers given a question and its retrieved context. This approach is highlighted as the least computationally expensive, making it ideal when context labels are readily available.

Joint Fine-tuning: In contrast, joint fine-tuning optimizes both the embedding and generator models simultaneously, end-to-end. Methods like RAG-Token or RAG-Sequence are used, which don’t require explicit context labels. Instead, the system learns to reward the embedding model for retrieving contexts that help the generator model produce better answers. While effective, this method can be more computationally intensive, especially if a joint search for optimal learning rates for both models is needed.

Two-Phase Fine-tuning: This strategy offers a middle ground. It first fine-tunes the generator model while keeping the embedding model fixed, and then fine-tunes the embedding model while the generator model is fixed. Like joint fine-tuning, it doesn’t require context labels. A key advantage of two-phase fine-tuning is that it allows for more efficient, independent searches for the best learning rates for each model, which can be less computationally demanding than a joint learning rate search.

Experimental Setup and Key Observations

The researchers conducted experiments using four different RAG pipelines, combining either MPNet or MiniLM as embedding models with LLaMA-3-8b-Instruct or Mistral-7b-Instruct-v0.1 as generator models. They tested these strategies on two popular datasets: HotPotQA and PopQA. Their retrieval system was set up to fetch the top 5 most relevant documents from a large Wikipedia corpus.

A crucial finding was that fine-tuning the generator model alone significantly improved EM and F1 scores, while fine-tuning the embedding model alone notably boosted Recall@5 (the ability to retrieve relevant documents). However, the generator model’s fine-tuning was found to be much more computationally expensive due to its larger size.

Ultimately, the study observed that independent, two-phase, and joint fine-tuning (using RAG-Sequence or RAG-Token) all achieved roughly similar levels of improvement in EM and F1 scores. This suggests they are equally effective in boosting RAG pipeline performance. The main differentiator, however, was computational cost: independent fine-tuning was the most economical, followed by joint fine-tuning, and then two-phase fine-tuning.

Also Read:

Conclusion and Recommendations

The paper concludes that the choice of fine-tuning strategy largely depends on the available resources and data. If your training dataset includes context labels, independent fine-tuning is the most computationally efficient and recommended approach. If context labels are not available, but you already have suitable learning rates for both models, joint fine-tuning is a good choice due to its lower computational cost compared to two-phase. However, if context labels are absent and you need to find optimal learning rates, two-phase fine-tuning is preferable because it allows for more efficient, independent grid searches for these rates.

The research also acknowledges limitations, such as not optimizing other hyperparameters like training epochs or batch size, and the focus on a basic RAG pipeline setup. Future work could explore how these strategies perform in more complex RAG architectures, such as those involving document re-ranking or multi-hop questions. For more in-depth technical details, you can access the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Fine-tuning RAG: How Different Strategies Impact AI Performance and Cost

Understanding the Fine-tuning Approaches

Experimental Setup and Key Observations

Conclusion and Recommendations

Gen AI News and Updates

Upwork Study Reveals AI Agents Thrive with Human Collaboration, Struggle Alone

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

AI Models Learn to Predict Polymer Properties from Images and Text

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates