Unpacking How Question Types Affect Large Language Model Performance

TLDR: A new study investigated how different question types (short answer, multiple-choice, true/false) impact the performance of five large language models on quantitative and deductive reasoning tasks. Key findings include significant performance differences across question types, a lack of consistent correlation between reasoning accuracy and final answer accuracy, and the influence of factors like the number of options and specific word choices on LLM performance. The research provides insights for improving LLM evaluation and capabilities.

Large Language Models (LLMs) are at the forefront of artificial intelligence, capable of understanding and generating human-like text. To assess their capabilities, especially in complex areas like reasoning, researchers use a variety of question formats. However, a crucial question remains: how do these different question types actually impact an LLM’s performance?

A recent study, titled “Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?”, delves into this unexplored territory. Conducted by Seok Hwan Song, Mohna Chakraborty, Qi Li, and Wallapak Tavanapong from Iowa State University, this research provides valuable insights into how question design influences LLM accuracy on reasoning tasks. You can find the full research paper here: Research Paper.

Investigating LLM Performance Across Question Types

The study evaluated five different LLMs: two closed-source models (GPT-4o and GPT-3.5-turbo) and three open-source models (Llama3 8B, Llama3.2 1B, and Gemma 7B). These models were tested on two distinct reasoning tasks: quantitative reasoning, using problems from the GSM8K dataset which involves arithmetic calculations, and deductive reasoning, using problems from the bAbI dataset which focuses on logical reasoning without math.

To understand the impact of question types, the researchers converted the original problems into three formats:

Short Answer Questions (SAQs): Where LLMs generate a free-form response and explanation.
Multiple-Choice Questions (MCQs): Where LLMs select an option from a given list.
True/False Questions (TFQs): Where LLMs judge a statement as true or false.

The performance was measured using two key metrics: Final Selection (FS) accuracy, which assesses only the correctness of the final answer, and Reasoning (R) accuracy, which evaluates the correctness of the steps leading to the answer. The latter required time-consuming manual evaluation.

Key Findings and Insights

The study yielded several significant findings:

Impact of Question Types: There are notable differences in LLM performance across SAQs, MCQs, and TFQs. For instance, on quantitative tasks, SAQ final selection accuracy was often better than MCQs, especially when “Something else” was the correct option. Conversely, TFQs (particularly with “False” as the correct answer) sometimes outperformed SAQs and MCQs in final selection accuracy.
Reasoning vs. Final Answer: A crucial discovery was that reasoning accuracy does not always correlate with final selection accuracy. An LLM might get the final answer right through a lucky guess or by being close enough, even with incorrect reasoning steps. Conversely, it might perform correct reasoning but fail to select the right final option.
Factors Influencing MCQs: The number of options significantly impacted performance, especially for deductive reasoning tasks, where 5-option MCQs generally outperformed 11-option MCQs. The position of the correct answer also played a role, with models often performing best when the “Something else” option was not the correct answer. Furthermore, the choice of words, such as using “None of the above” instead of “Something else,” could influence performance, indicating LLMs’ sensitivity to subtle wording variations.
Factors Influencing TFQs: The format of the question (e.g., “Is the answer X?” vs. “The answer is X.”) and whether “True” or “False” was the correct answer influenced performance. LLMs generally performed better when “True” was the correct answer. Additionally, the choice between “True or False” and “Yes or No” as response options also showed a statistically significant impact on some models, with “True or False” generally leading to better results.

Patterns of Incorrect Outputs

The researchers also categorized common patterns of errors:

Correct Final Selection, Wrong Reasoning: This includes instances of guessing, being “close enough” to the correct answer, or correctly selecting “Something else” despite flawed reasoning.
Wrong Final Selection, Correct Reasoning: Here, LLMs might perform valid calculations but choose the wrong option due to issues like incorrect unit conversion, or fail to select “Something else” when appropriate.
Wrong Final Selection and Wrong Reasoning: This category covers faulty reasoning, stopping mid-reasoning, or selecting incorrect options based on flawed logic.

Also Read:

Implications for AI Development

The findings highlight that simply evaluating LLMs with diverse question types isn’t enough; the specific design of these questions profoundly impacts the results. To truly improve LLM performance on reasoning tasks, it’s essential to focus on enhancing both the accuracy of their reasoning steps and their ability to correctly select final answers. This research provides valuable guidance for developing future benchmarks and refining LLM capabilities.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking How Question Types Affect Large Language Model Performance

Investigating LLM Performance Across Question Types

Key Findings and Insights

Patterns of Incorrect Outputs

Implications for AI Development

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates