spot_img
HomeResearch & DevelopmentUnlocking Deeper Reasoning in Language Models with Continuous Thought

Unlocking Deeper Reasoning in Language Models with Continuous Thought

TLDR: This research introduces a scalable reinforcement learning method to train LLMs using continuous “soft” and “fuzzy” tokens for Chain-of-Thought reasoning. This approach overcomes previous training difficulties, enabling models to explore more diverse reasoning paths. Experiments show that while continuous token training matches discrete token performance for single-attempt accuracy, it significantly improves performance for multiple attempts, indicating greater reasoning diversity. Crucially, models trained with continuous tokens can be deployed using standard discrete inference methods and better preserve their general knowledge on unrelated tasks.

Large Language Models (LLMs) have shown remarkable abilities in various reasoning tasks, especially when they use a ‘Chain-of-Thought’ (CoT) approach. This involves the model generating intermediate ‘thinking tokens’ before arriving at a final answer. However, the traditional CoT method is limited by the discrete nature of language tokens, meaning each step must be sampled one after another. This can restrict the model’s ability to express complex ideas and explore different reasoning paths, unlike human thought which often involves more fluid and abstract concepts.

Recent research has explored the idea of allowing LLMs to reason in continuous concept spaces, often called ‘continuous CoTs’ or ‘Soft Thinking’. Theoretically, this approach holds great promise. For instance, continuous thought vectors can act like ‘superposition states,’ allowing models to explore multiple reasoning paths simultaneously, leading to more efficient problem-solving. Imagine a model that can consider several solutions at once, rather than trying them one by one.

Despite these theoretical advantages, putting continuous reasoning into practice has been challenging. Previous methods either used continuous tokens only during the final prediction phase on models trained with discrete tokens, or required extensive computational resources to distill continuous CoTs from existing discrete ones, limiting their length to just a few tokens. Some studies even found that vanilla implementations of ‘Soft Thinking’ didn’t perform as well as their discrete counterparts, often defaulting to relying on the single most probable token.

A New Approach to Continuous Reasoning

This new research introduces a groundbreaking and scalable method to train LLMs with continuous CoTs using reinforcement learning (RL). What makes this approach unique is that it doesn’t need pre-existing discrete CoTs for distillation. The method uses ‘soft tokens,’ which are essentially mixtures of tokens combined with a bit of noise in the input embedding. This noise is crucial for allowing the RL algorithm to explore different reasoning possibilities.

The computational cost of this method is minimal, which means models can learn continuous CoTs with hundreds of tokens – a significant improvement over previous limitations. The researchers tested their approach on math reasoning benchmarks using Llama and Qwen models, up to 8 billion parameters.

Also Read:

Key Findings and Benefits

The results are compelling:

  • Performance Match: When evaluating the models for a single correct answer (pass@1), training with continuous CoTs performed just as well as training with traditional discrete tokens.

  • Enhanced Diversity: For scenarios where multiple attempts are allowed (pass@32), continuous CoT training significantly outperformed discrete CoT training. This suggests that continuous tokens enable the model to generate a wider variety of reasoning paths, leading to better overall success when given more chances.

  • Standard Deployment: One of the most practical findings is that the best performance is achieved by training with continuous CoT tokens and then using discrete tokens for inference. This means that models trained with this ‘soft’ method can be deployed using standard, existing inference techniques, making them easily adoptable by practitioners.

  • Improved Robustness: The continuous CoT RL training also proved to be gentler on the base model. It better preserved the model’s predictions on tasks outside its training domain, unlike discrete CoT training which sometimes degraded performance on these tasks. This indicates a ‘softer touch’ on the base model’s inherent capabilities.

  • Entropy Preservation: An analysis of the models’ entropy (a measure of uncertainty in token predictions) showed that soft or fuzzy training maintained a similar entropy profile to the original base models. In contrast, hard training often led to lower entropy, suggesting overconfidence and potentially less diverse reasoning.

This work demonstrates that continuous reasoning is not just an interesting theoretical concept but a practical and effective alternative for fine-tuning large language models. It offers a way to unlock deeper, more flexible reasoning capabilities in LLMs, paving the way for more robust and versatile AI systems. You can read the full research paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -