Enhancing Large Language Model Reasoning Through Contrastive Learning and Reinforced Fine-Tuning

TLDR: CARFT (Contrastive learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning) is a novel method designed to improve the reasoning capabilities of Large Language Models (LLMs). It addresses the limitations of traditional Supervised Fine-Tuning (SFT) and existing Reinforcement Learning (RL)-based approaches by integrating contrastive learning with annotated Chain-of-Thought (CoT). CARFT learns representations for CoTs and uses positive and negative contrastive signals, along with an embedding-enhanced partial reward, to guide the fine-tuning process. This results in significant performance gains (up to 10.15% accuracy) and enhanced training stability, effectively preventing model collapse observed in other RL methods.

Large Language Models (LLMs) have become indispensable across various fields, from mathematical problem-solving to financial analysis and medical applications. Their ability to reason is a cornerstone of their effectiveness, driving a surge of interest in enhancing this critical capability.

Traditionally, two main strategies have been employed to boost LLM reasoning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)-based fine-tuning. SFT involves training LLMs with datasets that include expert-annotated Chain-of-Thought (CoT) – step-by-step reasoning paths. While valuable, SFT often falls short because it relies on a single annotated CoT per question, limiting the model’s ability to generalize when multiple valid reasoning paths exist.

RL-based approaches, such as ReFT, emerged to address SFT’s limitations by dynamically sampling various CoTs during training, thereby improving generalization. However, these methods come with their own set of challenges. They often disregard the highly valuable annotated CoTs, relying solely on potentially flawed on-policy sampled CoTs. This can lead to issues like ‘reward hacking’ and, more critically, unstable training processes that can result in ‘model collapse,’ where the LLM’s performance significantly deteriorates.

Introducing CARFT: A Hybrid Approach for Robust Reasoning

To overcome these dual limitations, researchers have proposed a novel approach called CARFT: Contrastive learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning. CARFT aims to enhance LLM reasoning by effectively leveraging both high-quality annotated CoTs and dynamically sampled CoTs, ensuring both superior performance and training stability.

The CARFT framework operates in two sequential stages. It begins with an initial Supervised Fine-Tuning (SFT) phase, where the LLM is trained on annotated CoTs to build a foundational understanding of instruction-following. Following this, the model enters a ‘contrastive feedback’ stage, which is the core innovation of CARFT.

How CARFT Works: Key Mechanisms

At the heart of CARFT is the concept of learning a unified representation for each Chain-of-Thought. Whether a CoT is an expert-annotated one or a newly generated one, CARFT processes it to create a compact ’embedding’ that captures its essence. These embeddings are crucial for the next step: designing contrastive signals.

CARFT employs two types of contrastive signals:

Positive Signal: This signal encourages the model to generate CoTs that are similar to the high-quality annotated CoTs, especially when they lead to correct answers. By using a technique called InfoNCE loss, CARFT ensures that the embeddings of correct annotated CoTs and correct self-generated CoTs are drawn closer together.
Negative Signal: Conversely, CARFT also learns from incorrect reasoning paths. It identifies common elements between an annotated CoT and an incorrect self-generated CoT, and then uses a negative contrastive signal to push the embeddings of the incorrect parts away from the correct ones. This helps the model learn what not to do.

Furthermore, CARFT introduces an ’embedding-enhanced partial reward’ mechanism. Unlike previous methods that might assign a generic small reward for partially correct but ultimately wrong answers, CARFT uses the similarity between the generated CoT’s embedding and the annotated CoT’s embedding to assign a more nuanced partial reward. This fine-grained feedback encourages the generation of more ‘well-behaved’ CoTs, even when they don’t immediately lead to the final correct answer, thereby significantly improving training stability and overall performance.

Performance and Stability

Extensive experiments conducted on datasets like SVAMP and GSM8K, using foundation models such as CodeLlama-7B and Qwen2.5-7B-Instruct, demonstrate CARFT’s significant advantages. CARFT consistently outperforms both SFT and state-of-the-art RL-based methods like ReFT and Dr.GRPO, achieving accuracy improvements of up to 10.15%. Beyond raw performance, CARFT exhibits remarkable robustness and stability throughout the fine-tuning process, effectively mitigating the model collapse issues that often plague other RL-based approaches.

While CARFT introduces additional computational overhead for generating CoT embeddings, it proves to be more efficient than some complex RL methods like Dr.GRPO, which require generating a larger number of CoTs. The research paper, available at https://arxiv.org/pdf/2508.15868, provides a comprehensive look into the methodology and experimental results.

Also Read:

Looking Ahead

CARFT represents a significant step forward in enhancing the reasoning capabilities of LLMs by intelligently combining the strengths of supervised learning with the exploratory power of reinforcement learning, guided by contrastive signals. Future work aims to address its current limitations, such as adapting to decentralized data and extending its application to long-context scenarios, further solidifying its role in the advancement of AI reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Large Language Model Reasoning Through Contrastive Learning and Reinforced Fine-Tuning

Introducing CARFT: A Hybrid Approach for Robust Reasoning

How CARFT Works: Key Mechanisms

Performance and Stability

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates