TLDR: A new research paper introduces a novel method for controllable mathematical reasoning in large language models using “self-optimizing thought vectors.” These learnable vectors dynamically modulate the AI’s internal reasoning process, guided by entropy minimization as a self-supervised reward. The approach achieves 90.1% accuracy on the GSM8K math benchmark with Gemma-2-9B, demonstrating fine-grained control over reasoning depth, length, and path without external reward annotations. This work offers a path towards more transparent and adaptable AI systems.
A new research paper introduces a groundbreaking method for giving large language models, like those used in AI, more precise control over their mathematical reasoning. Traditionally, while these models are excellent at solving math problems, understanding and directing their internal thought processes has been a significant challenge. This new approach, called “self-optimizing thought vectors,” aims to change that by allowing us to influence how an AI thinks internally, rather than just its final output.
Understanding the Core Idea
The central concept behind this research is to view mathematical reasoning as a selection process among different computational pathways. Imagine solving a simple subtraction problem versus a multi-step word problem. An AI might activate different internal “thought vectors” for each scenario. For instance, a simple problem might trigger a “direct arithmetic” vector, while a complex one might blend “multi-step tracking” and “sequential subtraction” vectors. By introducing these learnable thought vectors, the system can guide the model towards more focused and controlled reasoning patterns.
Unlike previous methods that might add control codes to inputs or manipulate hidden states during generation, this technique directly modulates the model’s internal representations. It’s not just about changing the output format, but about influencing the actual internal thought process.
How It Works: Thought Vectors and Control
The system uses eight distinct learnable thought vectors, each representing a different reasoning strategy:
- Direct Computation (t1-t2): For simple arithmetic or fact retrieval.
- Sequential Tracking (t3-t4): For multi-step calculations and running totals.
- Algebraic Reasoning (t5-t6): For variable manipulation and equation solving.
- Verification/Checking (t7-t8): For validating answers and ensuring consistency.
These vectors are initialized to be diverse. When a problem is presented, the model’s current internal state helps select and combine these thought vectors, forming a weighted representation of the active reasoning approach. A clever “gating mechanism” then determines how much influence these thought vectors have at each step, allowing the model to selectively activate thought-enhanced representations when confident, or preserve its original internal states otherwise.
A Three-Dimensional Control Framework
The researchers developed a control framework that operates across three dimensions:
- Depth (1-5): Controls the complexity of reasoning, from simple calculations to multi-step derivations.
- Length (2-6): Determines how verbose the solution should be.
- Path (binary): Selects between a direct computation or a step-by-step reasoning approach.
These control signals are transformed into a high-dimensional representation that then modulates the selection of the thought vectors, allowing users to guide the AI’s reasoning style.
Self-Optimization Through Entropy
One of the most innovative aspects is the use of entropy minimization as a self-supervised training signal. In simple terms, entropy measures the focus of the thought vector selection. Low entropy means the model is confidently committing to a specific reasoning strategy, while high entropy suggests uncertainty or exploration of multiple strategies. By rewarding low entropy during training, the system encourages decisive and focused reasoning patterns without needing any external human feedback or annotations.
Impressive Results on Math Problems
The method was tested on the GSM8K benchmark, a dataset of grade-school math problems, using the Gemma-2-9B language model. The results were highly encouraging, achieving 90.1% accuracy. This not only surpasses the base model’s performance (21.1%) and even chain-of-thought prompting (89.7%) but also introduces the crucial capability of controllable reasoning. The analysis showed that depth control was particularly effective, successfully modulating reasoning complexity, while path control could switch between direct and explanatory modes.
Case studies vividly demonstrate this control. For a simple problem like “Sarah has 15 cookies and eats 3. How many are left?”, a low depth and direct path control would yield “15 – 3 = 12 cookies”. However, with high depth and a step-by-step path, the output would be more elaborate: “Starting amount: 15 cookies. Sarah eats: 3 cookies. To find remaining: 15 – 3 = 12. Therefore, Sarah has 12 cookies left.” This ability to tailor the reasoning process is a significant step forward.
Also Read:
- Enhancing Mathematical Reasoning in Language Models: A Reinforcement Learning Approach to Budget Forcing
- Decoding AI’s Thought Process: A State-Based Analysis of Large Reasoning Models
Why This Matters
The success of entropy as an internal optimizer suggests a new path for developing AI systems that are not only capable but also transparent and adaptable. By moving beyond “black-box” models, this research enables AI that can solve problems and adjust its reasoning based on specific user needs or contexts. This opens up exciting possibilities for applications beyond mathematics, paving the way for more interpretable and controllable AI. You can read the full paper here: Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors.


