spot_img
HomeResearch & DevelopmentA New Approach to Evaluating Reasoning in Large Language...

A New Approach to Evaluating Reasoning in Large Language Models

TLDR: This research paper introduces “AnswerRegeneration,” a novel framework for evaluating large language models (LLMs) that perform reasoning. It addresses the critical issue that traditional answer extraction methods from LLM outputs are highly sensitive and can lead to inconsistent performance scores and rankings. AnswerRegeneration uses an additional inference step where the LLM is prompted to regenerate a concise final answer based on its prior reasoning. This method consistently improves evaluation accuracy, provides more intuitive model rankings, and enhances robustness across various tasks, offering a more reliable and fair assessment of LLM capabilities.

Evaluating large language models (LLMs), especially those that perform complex reasoning, is a significant challenge. Traditionally, for tasks like question-answering, the final answer is often chosen based on the probability of different answer choices. However, for LLMs that generate a detailed reasoning process, simply picking an answer can be tricky. This research highlights a crucial, yet often overlooked, problem: the method used to extract the final answer from an LLM’s detailed reasoning output can drastically change its perceived performance.

The paper, titled “Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning,” reveals that the performance and even the ranking of reasoning models are highly sensitive to the specific algorithm used for answer extraction. This means that two different evaluation setups, using the same LLM and the same questions, could report vastly different scores just because they use different rules to find the answer in the model’s output.

Consider a multiple-choice question where an LLM provides a long thought process. It might box its answer, write it in free text, or even use different formats for different questions. Traditional rule-based extraction methods, which look for specific phrases like “Answer: X” or the last capital letter, often struggle with this variability. These rules need to be custom-tuned for every model and every type of question, making evaluations difficult to reproduce and potentially biased.

The Problem with Current Extraction Methods

The researchers empirically demonstrate this problem by evaluating several open-source reasoning models (like Qwen3 families and Deepseek-R1) using five different answer extraction methods. They found that model performance fluctuated significantly. For instance, a “strict-match” method might yield one set of rankings, while a “flexible-extract” method could produce entirely different results. Sometimes, the extraction process might even fail to find an answer at all, leading to an incorrect score.

An example provided in the paper shows how a single model’s output for a physics question could be interpreted differently by various extraction methods. One method might find an answer within the model’s internal thought process, while another might pick up an option text instead of the required option label, or even extract a single letter from a unit as the final answer. This “answer inconsistency” clearly shows how the choice of extraction method can introduce bias into the evaluation.

Another issue is “incomplete thinking.” Sometimes, an LLM’s reasoning process might not conclude within the set token limit, or it might contain repetitions. Rule-based methods often struggle to extract a definitive answer from such incomplete or ambiguous outputs.

Also Read:

Introducing AnswerRegeneration

To address these challenges, the researchers propose a basic yet effective framework called AnswerRegeneration. Instead of relying on complex, handcrafted rules to parse a final answer from a model’s extensive thought process, this method uses an additional inference step. After the LLM generates its reasoning, the framework provides the original input prompt and the LLM’s previous output (the reasoning process) to the model again, prefaced by a new prompt like “Answer:”. This prompts the model to generate a concise, final answer based on its prior reasoning.

This approach offers several key benefits. For multiple-choice questions, it allows for more robust, probability-based answering. For open-ended questions, it simplifies the model’s output, making the final answer much easier to extract with straightforward algorithms. The framework consistently outperforms rule-based extractions, leading to improved benchmark scores and better alignment with human evaluations.

Interestingly, AnswerRegeneration also helps in establishing more intuitive model rankings. For example, with rule-based methods, a smaller model might appear to outperform a larger one in the same family. However, with AnswerRegeneration, the rankings often align with the conventional understanding that larger models generally perform better. This suggests that the initial counterintuitive rankings were an artifact of the extraction methods, not a true reflection of the models’ capabilities.

The framework also enhances robustness. It handles answer inconsistencies, internal self-correction within model outputs, and even questions that ask for the “incorrect” choice, which often confuse rule-based methods. Furthermore, it significantly improves performance in cases where the model’s initial reasoning process is incomplete.

The researchers applied AnswerRegeneration to diverse tasks, including complex multiple-choice questions (MMLU-Pro), short-answer math problems (GSM8K), and open-ended question answering (TriviaQA). In all cases, the generation-based method proved to be a plausible and effective approach for fairer and more robust model evaluations. While the method does involve an additional computational cost for the inference step, its simplicity and the clarity of the results it provides are significant contributions.

This research underscores that reliable evaluation of reasoning LLMs requires careful consideration of how answers are extracted. The AnswerRegeneration framework offers a promising solution to mitigate biases and inconsistencies introduced by traditional rule-based extraction methods, paving the way for more accurate and consistent assessments of LLM capabilities. You can read the full paper here: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -