TLDR: New research shows that traditional multiple-choice questions fail to accurately evaluate LLMs due to ‘discriminative shortcuts.’ The paper proposes ‘answer matching,’ where LLMs generate free-form responses checked by another LLM against a reference, proving to be more accurate, cost-effective, and better aligned with human judgment for assessing true generative capabilities.
In the rapidly evolving field of Artificial Intelligence, particularly with the advent of large language models (LLMs), accurately evaluating their capabilities is paramount. For a long time, multiple-choice questions (MCQs) have been the go-to method for benchmarking LLMs due to their objective and easily automatable grading process. However, recent research highlights a significant flaw in this approach: LLMs can often answer MCQs correctly without truly understanding the question or demonstrating generative capabilities.
A new research paper titled “Answer Matching Outperforms Multiple Choice for Language Model Evaluation” by Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping, sheds light on this critical issue. The authors demonstrate that popular multiple-choice benchmarks can be exploited by models using “discriminative shortcuts.” This means models can pick the correct answer based on statistical patterns within the choices themselves, rather than generating a thoughtful, free-form response. For instance, a model might identify the “odd one out” or recognize patterns in choice length, leading to high accuracy without genuine comprehension.
The paper proposes and champions an alternative evaluation strategy called “answer matching.” This method involves giving the candidate LLM a question without any options, prompting it to generate a free-form response. Subsequently, a separate, modern language model (referred to as a “matcher”) is used to compare this generated response against a reference answer to determine if they semantically match. This approach directly assesses the generative capabilities of LLMs, which is how users primarily interact with these models in real-world applications.
To validate their findings, the researchers meticulously annotated human grading data for popular benchmarks like MMLU-Pro and GPQA-Diamond. Their results are striking: answer matching, even when utilizing relatively small and recent language models as matchers, achieves near-perfect agreement with human grading, often comparable to the agreement between two human annotators. In stark contrast, traditional multiple-choice evaluations and even LLM-as-a-judge methods (where an LLM scores a response without a reference answer) align poorly with human judgment.
The implications of this research are profound for the benchmarking ecosystem. The study reveals that model rankings can change significantly when evaluated using answer matching instead of MCQs. Models that appear to be “saturated” or performing exceptionally well on multiple-choice benchmarks often show considerable room for improvement when their true generative abilities are tested. Furthermore, the paper addresses concerns about cost, demonstrating that answer matching can be as, or even more, cost-effective than multiple-choice evaluations, primarily because models tend to generate shorter responses when not presented with choices.
Also Read:
- Unlocking LLM Potential: How NL2FLOW Revolutionizes AI Planning and Evaluation
- Do AI Agents Practice What They Preach? Unpacking Belief-Behavior Consistency in LLM Simulations
This work suggests a pivotal shift in how we evaluate language models. By moving away from potentially misleading multiple-choice formats towards more robust generative evaluations via answer matching, the AI community can gain a more accurate understanding of LLM capabilities and foster the development of truly intelligent systems. The full research paper can be accessed here.


