TLDR: The paper introduces CorefInst, the first multilingual coreference resolution (CR) method that uses instruction-tuned, decoder-only Large Language Models (LLMs) to handle both explicit and implicit (zero) mentions. It evaluates Llama 3.1, Gemma 2, and Mistral 0.3 with five instruction sets. The best model, a fine-tuned Llama 3.1, significantly outperforms state-of-the-art task-specific architectures like CorPipe24 on the CorefUD v1.2 dataset, especially in resolving zero mentions in pro-drop languages.
Coreference Resolution (CR) is a fundamental task in natural language understanding, where the goal is to identify and group all mentions in a text that refer to the same real-world entity. For example, in the sentence “John went to the store. He bought milk.”, both “John” and “He” refer to the same person. This task is vital for many advanced NLP applications, such as machine translation, text summarization, and question answering, as it helps systems grasp the coherence and context of a document.
Traditionally, CR systems have relied on task-specific architectures and encoder-based language models. While effective, these models often demand extensive training, large amounts of data, and lack flexibility across new datasets, languages, or domains. They also struggle to directly utilize the rapidly evolving decoder-only Large Language Models (LLMs) that have become prominent recently.
Introducing CorefInst: A New Approach to Multilingual Coreference Resolution
A recent research paper, CorefInst: Leveraging LLMs for Multilingual Coreference Resolution, introduces a groundbreaking methodology that addresses these limitations. Authored by Tu˘gba Pamay Arslan, Emircan Erol, and Gül¸ sen Eryi˘git from theË™ITÜ NLP Research Group, this study presents the first multilingual CR approach that leverages decoder-only LLMs to handle both overt (explicitly stated) and zero (implicitly dropped) mentions. Zero mentions are particularly challenging as they are not explicitly present in a sentence, common in ‘pro-drop’ languages like Turkish or Czech.
How CorefInst Works
The CorefInst methodology models the CR task for LLMs through a novel instruction-tuning paradigm. Instead of building complex, task-specific neural architectures, the researchers designed five different instruction sets to guide LLMs in resolving coreferences. These instructions provide the LLM with necessary information about the task, input/output formats, and specific constraints, effectively teaching the model how to act as a coreference resolver.
The process involves:
- Instruction Engineering: Crafting detailed instructions that define the task, explain input/output formats, and even incorporate insights from observed model errors (e.g., using terms like ‘coherent’ alongside ‘coreferential’ to prevent over-merging clusters).
- Data Processing: Converting raw text into a format suitable for LLMs, where mentions are marked with special tags (e.g., <m>…</m> for overt, </z> for zero mentions) and placeholders (MASK) are used for the LLM to fill in cluster numbers. The text is broken into ‘frames’ to manage long documents, and a post-processing step merges clusters across these frames to maintain document-level coherence.
- Controlled Inference Method: A unique inference strategy where the LLM predicts cluster numbers sequentially for each MASK token, leveraging its previous decisions to inform current ones. This method significantly reduces computational load and ensures consistency in output format.
Key Findings and Performance
The study evaluated CorefInst across three state-of-the-art decoder-only LLMs: Llama 3.1, Gemma 2, and Mistral 0.3, using the multilingual CorefUD v1.2 dataset. Initial experiments on an English subset showed that Llama 3.1, particularly with Instruction Set #5, achieved the highest performance.
When fully fine-tuned, the best model (Llama 3.1 with Instruction #5, referred to as CorefInstfull) demonstrated remarkable results:
- It outperformed its few-shot trained counterpart by an average of 12 percentage points across all languages on gold mentions.
- Crucially, CorefInstfull surpassed CorPipe24, the leading state-of-the-art task-specific multilingual CR model, by an average of 1.8 percentage points in end-to-end evaluation on predicted overt and zero mentions. This improvement was statistically significant.
- The model showed exceptional capability in resolving zero mentions in pro-drop languages, improving average performance by 8.4 percentage points compared to CorPipe24. This highlights the LLMs’ deeper semantic understanding of text, which is essential for identifying implicit references.
Also Read:
- Enhancing LLM Conversations: A New Strategy to Combat Forgetting and Boost Efficiency
- Improving LLM Problem Solving with Guided Pivotal Optimization
Impact and Future Directions
The CorefInst framework represents a significant step forward for multilingual coreference resolution. By demonstrating that instruction-tuned LLMs can outperform specialized architectures, this research paves the way for more flexible, scalable, and high-performing CR systems. It suggests that general-purpose LLMs, when properly fine-tuned, can effectively tackle complex linguistic tasks that previously required bespoke models.
The study also acknowledges limitations, such as the reliance on inter-frame relations for document-level coreference chains and the computational cost of fine-tuning large models. Future work will explore larger context windows, architectural comparisons, and integration with newer LLMs to further advance the field.


