spot_img
HomeResearch & DevelopmentThe Coreference Conundrum: Why LLMs Struggle to Balance Accuracy...

The Coreference Conundrum: Why LLMs Struggle to Balance Accuracy and Ambiguity Detection

TLDR: A new research paper, “CORRECT-DETECT,” reveals a fundamental trade-off in Large Language Models (LLMs): they struggle to simultaneously achieve high accuracy in coreference resolution for unambiguous sentences and reliably detect ambiguity in ambiguous ones. While humans naturally adjust their confidence based on context, LLMs show minimal shifts, often prioritizing confident answers over acknowledging uncertainty. The study, using the AmbiCoref dataset and evaluating GPT-4o and Llama 3.1, demonstrates that prompting can boost one capability but at the cost of the other, suggesting current training incentives may be a root cause.

Large Language Models (LLMs) are designed to mimic human language abilities, but a new study reveals a significant challenge: balancing accurate coreference resolution with the ability to detect ambiguity. Humans naturally use broad context to resolve linguistic ambiguities, even in short text snippets. LLMs, however, operate with a “contextual deficit,” lacking the social and physical context that humans rely on for nuanced understanding.

Coreference resolution, a fundamental task in language understanding, involves determining which text spans refer to the same entity. For example, identifying that “she” refers to “Anna” in the sentence “Anna looked out the window. She saw that the rain had stopped.” This capability is crucial for many higher-level tasks like summarization and question answering. However, when a sentence is genuinely ambiguous, like “Anna told Susan to look out the window. She saw that the rain had stopped,” humans can recognize the uncertainty, while LLMs often struggle.

The research paper, titled “CORRECT-DETECT: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs,” explores how well LLMs perform in both resolving coreference in clear cases and recognizing when ambiguity truly exists. The authors, Amber Shore, Russell Scheinberg, and Ameeta Agrawal from Portland State University, and So Young Lee from Miami University, investigated whether LLMs can achieve both goals simultaneously, a foundational aspect of human-like language processing.

To conduct their study, the researchers utilized the AmbiCoref dataset, which contains both unambiguous and ambiguous sentences. Each sentence features two people and a pronoun, with an accompanying question to resolve the coreference. The dataset categorizes sentences based on semantic properties related to the verb, such as Experiencer Constraint for Objects (ECO), Experiencer Constraint for Subjects (ECS), Implicit Causality (IC), and Transfer of Possession (TOP). This allowed for a detailed comparison of LLM behavior against human judgments.

The study evaluated two prominent LLMs: GPT-4o (gpt-4o-2024-08-06) and Llama 3.1 70B (Llama3.1-70b-Instruct). Initial experiments, using a “REFLECT” prompt designed to mirror human annotation instructions, revealed interesting divergences. While GPT-4o showed better alignment with human decisions in unambiguous sentences, Llama 3.1 frequently labeled unambiguous sentences as “mostly ambiguous.” Crucially, human annotators demonstrated a significant shift in their answer patterns between unambiguous and ambiguous sentences, indicating sensitivity to the lack of semantic constraints. LLMs, however, showed only a minimal shift, suggesting they are less responsive to genuine ambiguity.

Furthermore, LLMs exhibited higher answer consistency across multiple runs compared to humans, even in ambiguous scenarios. This implies that models are more stable in their (sometimes incorrect) judgments, whereas human responses naturally show more variation when faced with true ambiguity. In terms of accuracy on unambiguous sentences, GPT-4o performed better than Llama 3.1, but both were below human strict accuracy. When considering “near-correctness” (allowing for some uncertainty in the choice), GPT-4o’s performance improved significantly, even surpassing human near-correctness.

An interesting observation was the models’ tendency to provide unprompted explanations. For GPT-4o, these explanations were inversely correlated with performance, meaning that when the model explained its choice, it was more likely to be incorrect. The study also highlighted persistent gender bias, with both LLMs showing lower accuracy in sentences containing female pronouns and names compared to male ones, a bias not observed in human responses.

The core finding of the paper is the “CORRECT-DETECT” trade-off. The researchers found that LLMs struggle to simultaneously achieve high accuracy in resolving coreference in unambiguous sentences (Correct-Unamb) and reliably detecting ambiguity in ambiguous sentences (Detect-Ambig). When models are prompted to prioritize one goal, performance on the other often suffers. For instance, GPT-4o could achieve an impressive 99.55% ambiguity detection rate with the “Ambi-Wait” prompt, but its accuracy on unambiguous cases plummeted to 5.23%. Conversely, prompts that simply demanded a coreference resolution led to high accuracy but very low ambiguity detection.

This trade-off suggests that current LLMs are incentivized to provide a confident answer rather than acknowledging uncertainty, a problem that resonates with analyses of factual hallucinations in LLMs. The authors hypothesize that if training incentives could be adjusted to reward appropriate “I don’t know” responses, models might overcome this linguistic ambiguity challenge and achieve more human-like performance. The study also included additional experiments on the AmbiEnt dataset for ambiguity in entailment, where models generally failed to acknowledge ambiguity at all, further underscoring the difficulty in revealing this trade-off in different contexts.

Also Read:

In conclusion, while LLMs can achieve high scores in either coreference resolution or ambiguity detection depending on the prompt, they currently cannot do both well at the same time. This research highlights a fundamental limitation in how LLMs process linguistic ambiguity and points towards future work needed to bridge the gap between AI and human-like language understanding. You can read the full research paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -