Instruction Tuning Large Language Models for Better Coreference Resolution Across Languages

TLDR: The paper introduces CorefInst, the first multilingual coreference resolution (CR) method that uses instruction-tuned, decoder-only Large Language Models (LLMs) to handle both explicit and implicit (zero) mentions. It evaluates Llama 3.1, Gemma 2, and Mistral 0.3 with five instruction sets. The best model, a fine-tuned Llama 3.1, significantly outperforms state-of-the-art task-specific architectures like CorPipe24 on the CorefUD v1.2 dataset, especially in resolving zero mentions in pro-drop languages.

Coreference Resolution (CR) is a fundamental task in natural language understanding, where the goal is to identify and group all mentions in a text that refer to the same real-world entity. For example, in the sentence “John went to the store. He bought milk.”, both “John” and “He” refer to the same person. This task is vital for many advanced NLP applications, such as machine translation, text summarization, and question answering, as it helps systems grasp the coherence and context of a document.

Traditionally, CR systems have relied on task-specific architectures and encoder-based language models. While effective, these models often demand extensive training, large amounts of data, and lack flexibility across new datasets, languages, or domains. They also struggle to directly utilize the rapidly evolving decoder-only Large Language Models (LLMs) that have become prominent recently.

Introducing CorefInst: A New Approach to Multilingual Coreference Resolution

A recent research paper, CorefInst: Leveraging LLMs for Multilingual Coreference Resolution, introduces a groundbreaking methodology that addresses these limitations. Authored by Tu˘gba Pamay Arslan, Emircan Erol, and Gül¸ sen Eryi˘git from the˙ITÜ NLP Research Group, this study presents the first multilingual CR approach that leverages decoder-only LLMs to handle both overt (explicitly stated) and zero (implicitly dropped) mentions. Zero mentions are particularly challenging as they are not explicitly present in a sentence, common in ‘pro-drop’ languages like Turkish or Czech.

How CorefInst Works

The CorefInst methodology models the CR task for LLMs through a novel instruction-tuning paradigm. Instead of building complex, task-specific neural architectures, the researchers designed five different instruction sets to guide LLMs in resolving coreferences. These instructions provide the LLM with necessary information about the task, input/output formats, and specific constraints, effectively teaching the model how to act as a coreference resolver.

The process involves:

Instruction Engineering: Crafting detailed instructions that define the task, explain input/output formats, and even incorporate insights from observed model errors (e.g., using terms like ‘coherent’ alongside ‘coreferential’ to prevent over-merging clusters).
Data Processing: Converting raw text into a format suitable for LLMs, where mentions are marked with special tags (e.g., <m>…</m> for overt, </z> for zero mentions) and placeholders (MASK) are used for the LLM to fill in cluster numbers. The text is broken into ‘frames’ to manage long documents, and a post-processing step merges clusters across these frames to maintain document-level coherence.
Controlled Inference Method: A unique inference strategy where the LLM predicts cluster numbers sequentially for each MASK token, leveraging its previous decisions to inform current ones. This method significantly reduces computational load and ensures consistency in output format.

Key Findings and Performance

The study evaluated CorefInst across three state-of-the-art decoder-only LLMs: Llama 3.1, Gemma 2, and Mistral 0.3, using the multilingual CorefUD v1.2 dataset. Initial experiments on an English subset showed that Llama 3.1, particularly with Instruction Set #5, achieved the highest performance.

When fully fine-tuned, the best model (Llama 3.1 with Instruction #5, referred to as CorefInstfull) demonstrated remarkable results:

It outperformed its few-shot trained counterpart by an average of 12 percentage points across all languages on gold mentions.
Crucially, CorefInstfull surpassed CorPipe24, the leading state-of-the-art task-specific multilingual CR model, by an average of 1.8 percentage points in end-to-end evaluation on predicted overt and zero mentions. This improvement was statistically significant.
The model showed exceptional capability in resolving zero mentions in pro-drop languages, improving average performance by 8.4 percentage points compared to CorPipe24. This highlights the LLMs’ deeper semantic understanding of text, which is essential for identifying implicit references.

Also Read:

Impact and Future Directions

The CorefInst framework represents a significant step forward for multilingual coreference resolution. By demonstrating that instruction-tuned LLMs can outperform specialized architectures, this research paves the way for more flexible, scalable, and high-performing CR systems. It suggests that general-purpose LLMs, when properly fine-tuned, can effectively tackle complex linguistic tasks that previously required bespoke models.

The study also acknowledges limitations, such as the reliance on inter-frame relations for document-level coreference chains and the computational cost of fine-tuning large models. Future work will explore larger context windows, architectural comparisons, and integration with newer LLMs to further advance the field.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Instruction Tuning Large Language Models for Better Coreference Resolution Across Languages

Introducing CorefInst: A New Approach to Multilingual Coreference Resolution

How CorefInst Works

Key Findings and Performance

Impact and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates