Unlocking Deeper Reasoning in Language Models with Continuous Thought

TLDR: This research introduces a scalable reinforcement learning method to train LLMs using continuous “soft” and “fuzzy” tokens for Chain-of-Thought reasoning. This approach overcomes previous training difficulties, enabling models to explore more diverse reasoning paths. Experiments show that while continuous token training matches discrete token performance for single-attempt accuracy, it significantly improves performance for multiple attempts, indicating greater reasoning diversity. Crucially, models trained with continuous tokens can be deployed using standard discrete inference methods and better preserve their general knowledge on unrelated tasks.

Large Language Models (LLMs) have shown remarkable abilities in various reasoning tasks, especially when they use a ‘Chain-of-Thought’ (CoT) approach. This involves the model generating intermediate ‘thinking tokens’ before arriving at a final answer. However, the traditional CoT method is limited by the discrete nature of language tokens, meaning each step must be sampled one after another. This can restrict the model’s ability to express complex ideas and explore different reasoning paths, unlike human thought which often involves more fluid and abstract concepts.

Recent research has explored the idea of allowing LLMs to reason in continuous concept spaces, often called ‘continuous CoTs’ or ‘Soft Thinking’. Theoretically, this approach holds great promise. For instance, continuous thought vectors can act like ‘superposition states,’ allowing models to explore multiple reasoning paths simultaneously, leading to more efficient problem-solving. Imagine a model that can consider several solutions at once, rather than trying them one by one.

Despite these theoretical advantages, putting continuous reasoning into practice has been challenging. Previous methods either used continuous tokens only during the final prediction phase on models trained with discrete tokens, or required extensive computational resources to distill continuous CoTs from existing discrete ones, limiting their length to just a few tokens. Some studies even found that vanilla implementations of ‘Soft Thinking’ didn’t perform as well as their discrete counterparts, often defaulting to relying on the single most probable token.

A New Approach to Continuous Reasoning

This new research introduces a groundbreaking and scalable method to train LLMs with continuous CoTs using reinforcement learning (RL). What makes this approach unique is that it doesn’t need pre-existing discrete CoTs for distillation. The method uses ‘soft tokens,’ which are essentially mixtures of tokens combined with a bit of noise in the input embedding. This noise is crucial for allowing the RL algorithm to explore different reasoning possibilities.

The computational cost of this method is minimal, which means models can learn continuous CoTs with hundreds of tokens – a significant improvement over previous limitations. The researchers tested their approach on math reasoning benchmarks using Llama and Qwen models, up to 8 billion parameters.

Also Read:

Key Findings and Benefits

The results are compelling:

Performance Match: When evaluating the models for a single correct answer (pass@1), training with continuous CoTs performed just as well as training with traditional discrete tokens.
Enhanced Diversity: For scenarios where multiple attempts are allowed (pass@32), continuous CoT training significantly outperformed discrete CoT training. This suggests that continuous tokens enable the model to generate a wider variety of reasoning paths, leading to better overall success when given more chances.
Standard Deployment: One of the most practical findings is that the best performance is achieved by training with continuous CoT tokens and then using discrete tokens for inference. This means that models trained with this ‘soft’ method can be deployed using standard, existing inference techniques, making them easily adoptable by practitioners.
Improved Robustness: The continuous CoT RL training also proved to be gentler on the base model. It better preserved the model’s predictions on tasks outside its training domain, unlike discrete CoT training which sometimes degraded performance on these tasks. This indicates a ‘softer touch’ on the base model’s inherent capabilities.
Entropy Preservation: An analysis of the models’ entropy (a measure of uncertainty in token predictions) showed that soft or fuzzy training maintained a similar entropy profile to the original base models. In contrast, hard training often led to lower entropy, suggesting overconfidence and potentially less diverse reasoning.

This work demonstrates that continuous reasoning is not just an interesting theoretical concept but a practical and effective alternative for fine-tuning large language models. It offers a way to unlock deeper, more flexible reasoning capabilities in LLMs, paving the way for more robust and versatile AI systems. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Deeper Reasoning in Language Models with Continuous Thought

A New Approach to Continuous Reasoning

Key Findings and Benefits

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates