The Coreference Conundrum: Why LLMs Struggle to Balance Accuracy and Ambiguity Detection

TLDR: A new research paper, “CORRECT-DETECT,” reveals a fundamental trade-off in Large Language Models (LLMs): they struggle to simultaneously achieve high accuracy in coreference resolution for unambiguous sentences and reliably detect ambiguity in ambiguous ones. While humans naturally adjust their confidence based on context, LLMs show minimal shifts, often prioritizing confident answers over acknowledging uncertainty. The study, using the AmbiCoref dataset and evaluating GPT-4o and Llama 3.1, demonstrates that prompting can boost one capability but at the cost of the other, suggesting current training incentives may be a root cause.

Large Language Models (LLMs) are designed to mimic human language abilities, but a new study reveals a significant challenge: balancing accurate coreference resolution with the ability to detect ambiguity. Humans naturally use broad context to resolve linguistic ambiguities, even in short text snippets. LLMs, however, operate with a “contextual deficit,” lacking the social and physical context that humans rely on for nuanced understanding.

Coreference resolution, a fundamental task in language understanding, involves determining which text spans refer to the same entity. For example, identifying that “she” refers to “Anna” in the sentence “Anna looked out the window. She saw that the rain had stopped.” This capability is crucial for many higher-level tasks like summarization and question answering. However, when a sentence is genuinely ambiguous, like “Anna told Susan to look out the window. She saw that the rain had stopped,” humans can recognize the uncertainty, while LLMs often struggle.

The research paper, titled “CORRECT-DETECT: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs,” explores how well LLMs perform in both resolving coreference in clear cases and recognizing when ambiguity truly exists. The authors, Amber Shore, Russell Scheinberg, and Ameeta Agrawal from Portland State University, and So Young Lee from Miami University, investigated whether LLMs can achieve both goals simultaneously, a foundational aspect of human-like language processing.

To conduct their study, the researchers utilized the AmbiCoref dataset, which contains both unambiguous and ambiguous sentences. Each sentence features two people and a pronoun, with an accompanying question to resolve the coreference. The dataset categorizes sentences based on semantic properties related to the verb, such as Experiencer Constraint for Objects (ECO), Experiencer Constraint for Subjects (ECS), Implicit Causality (IC), and Transfer of Possession (TOP). This allowed for a detailed comparison of LLM behavior against human judgments.

The study evaluated two prominent LLMs: GPT-4o (gpt-4o-2024-08-06) and Llama 3.1 70B (Llama3.1-70b-Instruct). Initial experiments, using a “REFLECT” prompt designed to mirror human annotation instructions, revealed interesting divergences. While GPT-4o showed better alignment with human decisions in unambiguous sentences, Llama 3.1 frequently labeled unambiguous sentences as “mostly ambiguous.” Crucially, human annotators demonstrated a significant shift in their answer patterns between unambiguous and ambiguous sentences, indicating sensitivity to the lack of semantic constraints. LLMs, however, showed only a minimal shift, suggesting they are less responsive to genuine ambiguity.

Furthermore, LLMs exhibited higher answer consistency across multiple runs compared to humans, even in ambiguous scenarios. This implies that models are more stable in their (sometimes incorrect) judgments, whereas human responses naturally show more variation when faced with true ambiguity. In terms of accuracy on unambiguous sentences, GPT-4o performed better than Llama 3.1, but both were below human strict accuracy. When considering “near-correctness” (allowing for some uncertainty in the choice), GPT-4o’s performance improved significantly, even surpassing human near-correctness.

An interesting observation was the models’ tendency to provide unprompted explanations. For GPT-4o, these explanations were inversely correlated with performance, meaning that when the model explained its choice, it was more likely to be incorrect. The study also highlighted persistent gender bias, with both LLMs showing lower accuracy in sentences containing female pronouns and names compared to male ones, a bias not observed in human responses.

The core finding of the paper is the “CORRECT-DETECT” trade-off. The researchers found that LLMs struggle to simultaneously achieve high accuracy in resolving coreference in unambiguous sentences (Correct-Unamb) and reliably detecting ambiguity in ambiguous sentences (Detect-Ambig). When models are prompted to prioritize one goal, performance on the other often suffers. For instance, GPT-4o could achieve an impressive 99.55% ambiguity detection rate with the “Ambi-Wait” prompt, but its accuracy on unambiguous cases plummeted to 5.23%. Conversely, prompts that simply demanded a coreference resolution led to high accuracy but very low ambiguity detection.

This trade-off suggests that current LLMs are incentivized to provide a confident answer rather than acknowledging uncertainty, a problem that resonates with analyses of factual hallucinations in LLMs. The authors hypothesize that if training incentives could be adjusted to reward appropriate “I don’t know” responses, models might overcome this linguistic ambiguity challenge and achieve more human-like performance. The study also included additional experiments on the AmbiEnt dataset for ambiguity in entailment, where models generally failed to acknowledge ambiguity at all, further underscoring the difficulty in revealing this trade-off in different contexts.

Also Read:

In conclusion, while LLMs can achieve high scores in either coreference resolution or ambiguity detection depending on the prompt, they currently cannot do both well at the same time. This research highlights a fundamental limitation in how LLMs process linguistic ambiguity and points towards future work needed to bridge the gap between AI and human-like language understanding. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The Coreference Conundrum: Why LLMs Struggle to Balance Accuracy and Ambiguity Detection

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates