TLDR: CARFT (Contrastive learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning) is a novel method designed to improve the reasoning capabilities of Large Language Models (LLMs). It addresses the limitations of traditional Supervised Fine-Tuning (SFT) and existing Reinforcement Learning (RL)-based approaches by integrating contrastive learning with annotated Chain-of-Thought (CoT). CARFT learns representations for CoTs and uses positive and negative contrastive signals, along with an embedding-enhanced partial reward, to guide the fine-tuning process. This results in significant performance gains (up to 10.15% accuracy) and enhanced training stability, effectively preventing model collapse observed in other RL methods.
Large Language Models (LLMs) have become indispensable across various fields, from mathematical problem-solving to financial analysis and medical applications. Their ability to reason is a cornerstone of their effectiveness, driving a surge of interest in enhancing this critical capability.
Traditionally, two main strategies have been employed to boost LLM reasoning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)-based fine-tuning. SFT involves training LLMs with datasets that include expert-annotated Chain-of-Thought (CoT) – step-by-step reasoning paths. While valuable, SFT often falls short because it relies on a single annotated CoT per question, limiting the model’s ability to generalize when multiple valid reasoning paths exist.
RL-based approaches, such as ReFT, emerged to address SFT’s limitations by dynamically sampling various CoTs during training, thereby improving generalization. However, these methods come with their own set of challenges. They often disregard the highly valuable annotated CoTs, relying solely on potentially flawed on-policy sampled CoTs. This can lead to issues like ‘reward hacking’ and, more critically, unstable training processes that can result in ‘model collapse,’ where the LLM’s performance significantly deteriorates.
Introducing CARFT: A Hybrid Approach for Robust Reasoning
To overcome these dual limitations, researchers have proposed a novel approach called CARFT: Contrastive learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning. CARFT aims to enhance LLM reasoning by effectively leveraging both high-quality annotated CoTs and dynamically sampled CoTs, ensuring both superior performance and training stability.
The CARFT framework operates in two sequential stages. It begins with an initial Supervised Fine-Tuning (SFT) phase, where the LLM is trained on annotated CoTs to build a foundational understanding of instruction-following. Following this, the model enters a ‘contrastive feedback’ stage, which is the core innovation of CARFT.
How CARFT Works: Key Mechanisms
At the heart of CARFT is the concept of learning a unified representation for each Chain-of-Thought. Whether a CoT is an expert-annotated one or a newly generated one, CARFT processes it to create a compact ’embedding’ that captures its essence. These embeddings are crucial for the next step: designing contrastive signals.
CARFT employs two types of contrastive signals:
- Positive Signal: This signal encourages the model to generate CoTs that are similar to the high-quality annotated CoTs, especially when they lead to correct answers. By using a technique called InfoNCE loss, CARFT ensures that the embeddings of correct annotated CoTs and correct self-generated CoTs are drawn closer together.
- Negative Signal: Conversely, CARFT also learns from incorrect reasoning paths. It identifies common elements between an annotated CoT and an incorrect self-generated CoT, and then uses a negative contrastive signal to push the embeddings of the incorrect parts away from the correct ones. This helps the model learn what not to do.
Furthermore, CARFT introduces an ’embedding-enhanced partial reward’ mechanism. Unlike previous methods that might assign a generic small reward for partially correct but ultimately wrong answers, CARFT uses the similarity between the generated CoT’s embedding and the annotated CoT’s embedding to assign a more nuanced partial reward. This fine-grained feedback encourages the generation of more ‘well-behaved’ CoTs, even when they don’t immediately lead to the final correct answer, thereby significantly improving training stability and overall performance.
Performance and Stability
Extensive experiments conducted on datasets like SVAMP and GSM8K, using foundation models such as CodeLlama-7B and Qwen2.5-7B-Instruct, demonstrate CARFT’s significant advantages. CARFT consistently outperforms both SFT and state-of-the-art RL-based methods like ReFT and Dr.GRPO, achieving accuracy improvements of up to 10.15%. Beyond raw performance, CARFT exhibits remarkable robustness and stability throughout the fine-tuning process, effectively mitigating the model collapse issues that often plague other RL-based approaches.
While CARFT introduces additional computational overhead for generating CoT embeddings, it proves to be more efficient than some complex RL methods like Dr.GRPO, which require generating a larger number of CoTs. The research paper, available at https://arxiv.org/pdf/2508.15868, provides a comprehensive look into the methodology and experimental results.
Also Read:
- Decoding LLM Fine-Tuning: How Reinforcement Learning Recovers Lost Generalization
- Unlocking LLM Potential: A Seed-Free Approach to Instruction Tuning
Looking Ahead
CARFT represents a significant step forward in enhancing the reasoning capabilities of LLMs by intelligently combining the strengths of supervised learning with the exploratory power of reinforcement learning, guided by contrastive signals. Future work aims to address its current limitations, such as adapting to decentralized data and extending its application to long-context scenarios, further solidifying its role in the advancement of AI reasoning.


