A New Approach to Evaluating Reasoning in Large Language Models

TLDR: This research paper introduces “AnswerRegeneration,” a novel framework for evaluating large language models (LLMs) that perform reasoning. It addresses the critical issue that traditional answer extraction methods from LLM outputs are highly sensitive and can lead to inconsistent performance scores and rankings. AnswerRegeneration uses an additional inference step where the LLM is prompted to regenerate a concise final answer based on its prior reasoning. This method consistently improves evaluation accuracy, provides more intuitive model rankings, and enhances robustness across various tasks, offering a more reliable and fair assessment of LLM capabilities.

Evaluating large language models (LLMs), especially those that perform complex reasoning, is a significant challenge. Traditionally, for tasks like question-answering, the final answer is often chosen based on the probability of different answer choices. However, for LLMs that generate a detailed reasoning process, simply picking an answer can be tricky. This research highlights a crucial, yet often overlooked, problem: the method used to extract the final answer from an LLM’s detailed reasoning output can drastically change its perceived performance.

The paper, titled “Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning,” reveals that the performance and even the ranking of reasoning models are highly sensitive to the specific algorithm used for answer extraction. This means that two different evaluation setups, using the same LLM and the same questions, could report vastly different scores just because they use different rules to find the answer in the model’s output.

Consider a multiple-choice question where an LLM provides a long thought process. It might box its answer, write it in free text, or even use different formats for different questions. Traditional rule-based extraction methods, which look for specific phrases like “Answer: X” or the last capital letter, often struggle with this variability. These rules need to be custom-tuned for every model and every type of question, making evaluations difficult to reproduce and potentially biased.

The Problem with Current Extraction Methods

The researchers empirically demonstrate this problem by evaluating several open-source reasoning models (like Qwen3 families and Deepseek-R1) using five different answer extraction methods. They found that model performance fluctuated significantly. For instance, a “strict-match” method might yield one set of rankings, while a “flexible-extract” method could produce entirely different results. Sometimes, the extraction process might even fail to find an answer at all, leading to an incorrect score.

An example provided in the paper shows how a single model’s output for a physics question could be interpreted differently by various extraction methods. One method might find an answer within the model’s internal thought process, while another might pick up an option text instead of the required option label, or even extract a single letter from a unit as the final answer. This “answer inconsistency” clearly shows how the choice of extraction method can introduce bias into the evaluation.

Another issue is “incomplete thinking.” Sometimes, an LLM’s reasoning process might not conclude within the set token limit, or it might contain repetitions. Rule-based methods often struggle to extract a definitive answer from such incomplete or ambiguous outputs.

Also Read:

Introducing AnswerRegeneration

To address these challenges, the researchers propose a basic yet effective framework called AnswerRegeneration. Instead of relying on complex, handcrafted rules to parse a final answer from a model’s extensive thought process, this method uses an additional inference step. After the LLM generates its reasoning, the framework provides the original input prompt and the LLM’s previous output (the reasoning process) to the model again, prefaced by a new prompt like “Answer:”. This prompts the model to generate a concise, final answer based on its prior reasoning.

This approach offers several key benefits. For multiple-choice questions, it allows for more robust, probability-based answering. For open-ended questions, it simplifies the model’s output, making the final answer much easier to extract with straightforward algorithms. The framework consistently outperforms rule-based extractions, leading to improved benchmark scores and better alignment with human evaluations.

Interestingly, AnswerRegeneration also helps in establishing more intuitive model rankings. For example, with rule-based methods, a smaller model might appear to outperform a larger one in the same family. However, with AnswerRegeneration, the rankings often align with the conventional understanding that larger models generally perform better. This suggests that the initial counterintuitive rankings were an artifact of the extraction methods, not a true reflection of the models’ capabilities.

The framework also enhances robustness. It handles answer inconsistencies, internal self-correction within model outputs, and even questions that ask for the “incorrect” choice, which often confuse rule-based methods. Furthermore, it significantly improves performance in cases where the model’s initial reasoning process is incomplete.

The researchers applied AnswerRegeneration to diverse tasks, including complex multiple-choice questions (MMLU-Pro), short-answer math problems (GSM8K), and open-ended question answering (TriviaQA). In all cases, the generation-based method proved to be a plausible and effective approach for fairer and more robust model evaluations. While the method does involve an additional computational cost for the inference step, its simplicity and the clarity of the results it provides are significant contributions.

This research underscores that reliable evaluation of reasoning LLMs requires careful consideration of how answers are extracted. The AnswerRegeneration framework offers a promising solution to mitigate biases and inconsistencies introduced by traditional rule-based extraction methods, paving the way for more accurate and consistent assessments of LLM capabilities. You can read the full paper here: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Approach to Evaluating Reasoning in Large Language Models

The Problem with Current Extraction Methods

Introducing AnswerRegeneration

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates