Unlocking Coreference Resolution for Full-Length Books

TLDR: This research introduces BOOK COREF, the first book-scale coreference resolution benchmark, addressing the limitation of existing datasets that struggle with long texts. It details an automatic pipeline for high-quality annotations focusing on book characters. Experiments show that while current systems improve when trained on BOOK COREF, they still face significant challenges in achieving high performance on full books compared to shorter segments, highlighting new research directions for long-document coreference resolution.

Coreference Resolution (CR) is a fundamental task in Natural Language Processing (NLP) that involves identifying and grouping mentions in a text that refer to the same real-world entity. For instance, in a sentence like “John went to the store. He bought milk,” both “John” and “He” refer to the same person. While CR systems have made significant strides, their evaluation has traditionally been limited to small- to medium-sized documents, such as news articles or short stories.

However, when it comes to analyzing much longer texts, like entire books, existing benchmarks fall short. Datasets like LitBank, while valuable, often truncate documents or consist of shorter samples, failing to capture the complex web of co-referring mentions that can span hundreds of thousands of tokens across a full narrative. This limitation means that current CR systems, optimized for shorter inputs, often struggle to maintain performance and consistency when faced with the vastness of a book.

To address this critical gap, researchers from the Sapienza NLP Group at Sapienza University of Rome have introduced a groundbreaking new resource: BOOK COREF. This is the first-ever book-scale coreference benchmark, designed specifically to evaluate and advance CR systems for full narrative texts. With an average document length exceeding 200,000 tokens, BOOK COREF provides an unprecedented challenge for the field.

The creation of BOOK COREF involved a novel automatic pipeline that produces high-quality coreference annotations. This pipeline focuses specifically on characters within a book, recognizing their central role in fictional stories. The process unfolds in several key stages:

The BookCoref Pipeline: Annotating Books Automatically

First, the pipeline performs Character Linking, identifying explicit mentions of characters (like proper nouns) and linking them to a predefined list of characters for each book. This step leverages a fine-tuned Entity Linking system to ensure accuracy.

Next, a Cluster Refinement step uses a Large Language Model (LLM) to verify the precision of these initial links. The LLM is prompted to confirm whether a highlighted mention in context accurately corresponds to a character, filtering out incorrect assignments and ensuring a high-precision starting point.

Finally, Cluster Expansion takes these refined clusters of explicit mentions and expands them to include all other co-referring mentions, such as pronouns and other noun phrases. This is done by applying a state-of-the-art CR model, Maverick, to smaller, consecutive windows of the text. To ensure comprehensive coverage across the entire book, an intermediate grouping step merges these windows and runs a second expansion, effectively capturing long-distance coreference relations that might otherwise be missed.

The robustness of this automatic procedure was extensively validated, with the silver-annotation pipeline achieving an impressive MUC score of 93.3, comparable to human inter-annotator agreement rates on other established datasets.

Also Read:

Unprecedented Scale and New Challenges

BOOK COREF comprises two main parts: BOOK COREF silver, a large-scale training corpus of 50 books with over 10 million tagged tokens, and BOOK COREF gold, a manually annotated test set of three full books (including George Orwell’s Animal Farm, Herman Hesse’s Siddhartha, and Jane Austen’s Pride and Prejudice). This gold standard allows for rigorous evaluation of CR models at true book scale.

Experiments conducted on BOOK COREF reveal significant insights. While current long-document CR systems like Longdoc and Dual cache show improved performance when fine-tuned on the BOOK COREF silver data, they still exhibit a notable decrease in performance when evaluated on full books compared to smaller, split segments of the same texts. For instance, the best-performing model, Longdoc, achieved 67.0 CoNLL-F1 points on full books, but scores jumped to 77.1 points when evaluated on medium-sized text windows. This highlights that processing entire books introduces unique challenges that current models are not yet fully equipped to handle.

The research also points out open challenges, such as the need to better understand and improve coreference metrics for full-book evaluation, and the development of new systems that can truly leverage book-scale annotations. Furthermore, it emphasizes the need for efficient solutions to adapt large generative and encoder-only models for book-scale processing without incurring prohibitive computational costs.

By making their data and code publicly available at https://github.com/sapienzanlp/bookcoref, the researchers aim to foster further investigation and development in this crucial area of NLP, paving the way for more accurate and robust coreference resolution in the context of long-form narratives.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Coreference Resolution for Full-Length Books

The BookCoref Pipeline: Annotating Books Automatically

Unprecedented Scale and New Challenges

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates