Unlocking More Reliable AI Reasoning: A Solvability-Based Approach to Multiple-Choice Questions

TLDR: This research paper introduces ‘solvability’ as a new metric to assess if a large language model (LLM) can genuinely solve a multiple-choice question, rather than just guessing. By integrating this solvability into outcome-supervised reward models (MCQ-ORM) and reinforcement learning (MCQ-DrGRPO), the authors demonstrate significant improvements in the process-correctness of the AI’s reasoning steps and, in RL, also enhance final answer accuracy. This approach helps reduce ‘hallucinations’ and makes AI reasoning more reliable by focusing learning on genuinely solvable problems.

Large language models (LLMs) have shown remarkable abilities in complex reasoning tasks, often by generating a ‘chain of thought’ (CoT) – a series of intermediate steps leading to a final answer. However, the quality of this reasoning isn’t just about getting the right answer; it’s also about whether the steps taken are logically sound and correct. Sometimes, an LLM might arrive at the correct answer through a flawed or ‘spurious’ reasoning process, leading to what are known as false positives. This issue is particularly noticeable in multiple-choice question answering (MCQA), where models can sometimes guess correctly without truly understanding the problem.

A recent research paper, titled ‘BOOSTING PROCESS-CORRECT COT REASONING BY MODELING SOLVABILITY OF MULTIPLE-CHOICE QA,’ by Raphael Schumann and Stefan Riezler, delves into this challenge. The authors propose a novel approach: explicitly modeling the ‘solvability’ of a question for a given LLM. They argue that when a question is effectively unsolvable for a model, it’s more prone to generating these misleading, process-incorrect CoTs. By understanding and quantifying solvability, they aim to make AI reasoning more reliable and reduce ‘hallucinations’ – instances where the model generates plausible but incorrect information.

Understanding Solvability

At its core, solvability refers to the probability that a model’s true performance on a question exceeds random guessing. For multiple-choice questions, random guessing is simply 1 divided by the number of options. The researchers estimate this ‘true performance’ by sampling multiple CoTs for each question and observing how many lead to the correct answer. The more correct answers observed, the higher the estimated solvability. Interestingly, the number of answer choices and the number of CoTs sampled significantly influence this estimation. More choices or more samples provide a clearer picture of whether a question is genuinely solvable for the model.

The paper empirically demonstrates a strong link between a question’s solvability and the model’s ability to generate a ‘process-correct’ CoT – one where the thought process itself is judged to be valid. If a question is deemed unsolvable, the model is highly unlikely to produce a correct thought process, even if it occasionally stumbles upon the right answer.

Improving Reasoning at Test-Time

One way to enhance reasoning is to select the best CoT from multiple generated options. Traditionally, outcome-supervised reward models (ORMs) are trained to predict if a generated answer is correct. Schumann and Riezler introduce a modification called MCQ-ORM, which incorporates the estimated solvability into the ORM’s objective. This means that CoTs generated for questions that are likely unsolvable (and thus prone to false positives) are given less weight. This adjustment helps the model prioritize and select CoTs that are not only answer-correct but also more likely to be process-correct. Experiments on various math reasoning datasets showed that MCQ-ORM consistently outperformed standard ORMs and other baselines in selecting process-correct CoTs.

Also Read:

Reinforcement Learning with Solvability-Adjusted Advantage

The researchers also applied their solvability concept to reinforcement learning (RL), specifically by adjusting the ‘advantage’ calculation in algorithms like DrGRPO. Advantage in RL determines how much a particular action (generating a CoT) is favored. They found that traditional advantage calculations could sometimes over-emphasize correct answers from questions where the model was largely guessing, leading to noisy learning signals. To counter this, they proposed MCQ-DrGRPO, which multiplies the advantage by the question’s solvability. This effectively down-weights CoTs from unsolvable questions, focusing the learning on instances with higher ‘learning potential’ – a balance between novelty (how much the model currently struggles) and solvability (how likely it is to genuinely learn).

The results from RL experiments were even more compelling. MCQ-DrGRPO consistently achieved higher rewards during training and led to significant improvements in both process accuracy and answer accuracy across math and novel multimodal reasoning datasets (like geo-guessing and year-guessing from images). The analysis further revealed that MCQ-DrGRPO leads to a ‘sharper’ output distribution, meaning the model generates correct CoTs more consistently, rather than relying on diverse but potentially noisy outputs. For more technical details, you can refer to the full research paper here.

In conclusion, this research highlights ‘solvability’ as a crucial factor for developing more reliable and less hallucinatory LLM reasoning. By explicitly modeling whether a question is genuinely within a model’s grasp, and integrating this understanding into both reward models and reinforcement learning, we can significantly boost the process-correctness of AI’s thought processes, leading to more trustworthy and accurate AI systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking More Reliable AI Reasoning: A Solvability-Based Approach to Multiple-Choice Questions

Understanding Solvability

Improving Reasoning at Test-Time

Reinforcement Learning with Solvability-Adjusted Advantage

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates