TLDR: This research introduces BOOK COREF, the first book-scale coreference resolution benchmark, addressing the limitation of existing datasets that struggle with long texts. It details an automatic pipeline for high-quality annotations focusing on book characters. Experiments show that while current systems improve when trained on BOOK COREF, they still face significant challenges in achieving high performance on full books compared to shorter segments, highlighting new research directions for long-document coreference resolution.
Coreference Resolution (CR) is a fundamental task in Natural Language Processing (NLP) that involves identifying and grouping mentions in a text that refer to the same real-world entity. For instance, in a sentence like “John went to the store. He bought milk,” both “John” and “He” refer to the same person. While CR systems have made significant strides, their evaluation has traditionally been limited to small- to medium-sized documents, such as news articles or short stories.
However, when it comes to analyzing much longer texts, like entire books, existing benchmarks fall short. Datasets like LitBank, while valuable, often truncate documents or consist of shorter samples, failing to capture the complex web of co-referring mentions that can span hundreds of thousands of tokens across a full narrative. This limitation means that current CR systems, optimized for shorter inputs, often struggle to maintain performance and consistency when faced with the vastness of a book.
To address this critical gap, researchers from the Sapienza NLP Group at Sapienza University of Rome have introduced a groundbreaking new resource: BOOK COREF. This is the first-ever book-scale coreference benchmark, designed specifically to evaluate and advance CR systems for full narrative texts. With an average document length exceeding 200,000 tokens, BOOK COREF provides an unprecedented challenge for the field.
The creation of BOOK COREF involved a novel automatic pipeline that produces high-quality coreference annotations. This pipeline focuses specifically on characters within a book, recognizing their central role in fictional stories. The process unfolds in several key stages:
The BookCoref Pipeline: Annotating Books Automatically
First, the pipeline performs Character Linking, identifying explicit mentions of characters (like proper nouns) and linking them to a predefined list of characters for each book. This step leverages a fine-tuned Entity Linking system to ensure accuracy.
Next, a Cluster Refinement step uses a Large Language Model (LLM) to verify the precision of these initial links. The LLM is prompted to confirm whether a highlighted mention in context accurately corresponds to a character, filtering out incorrect assignments and ensuring a high-precision starting point.
Finally, Cluster Expansion takes these refined clusters of explicit mentions and expands them to include all other co-referring mentions, such as pronouns and other noun phrases. This is done by applying a state-of-the-art CR model, Maverick, to smaller, consecutive windows of the text. To ensure comprehensive coverage across the entire book, an intermediate grouping step merges these windows and runs a second expansion, effectively capturing long-distance coreference relations that might otherwise be missed.
The robustness of this automatic procedure was extensively validated, with the silver-annotation pipeline achieving an impressive MUC score of 93.3, comparable to human inter-annotator agreement rates on other established datasets.
Also Read:
- Improving RAG Systems with Smart Text Segmentation and Clustering
- TREC 2022 Deep Learning Track: Advancing Reusable Test Collections and Neural Retrieval
Unprecedented Scale and New Challenges
BOOK COREF comprises two main parts: BOOK COREF silver, a large-scale training corpus of 50 books with over 10 million tagged tokens, and BOOK COREF gold, a manually annotated test set of three full books (including George Orwell’s Animal Farm, Herman Hesse’s Siddhartha, and Jane Austen’s Pride and Prejudice). This gold standard allows for rigorous evaluation of CR models at true book scale.
Experiments conducted on BOOK COREF reveal significant insights. While current long-document CR systems like Longdoc and Dual cache show improved performance when fine-tuned on the BOOK COREF silver data, they still exhibit a notable decrease in performance when evaluated on full books compared to smaller, split segments of the same texts. For instance, the best-performing model, Longdoc, achieved 67.0 CoNLL-F1 points on full books, but scores jumped to 77.1 points when evaluated on medium-sized text windows. This highlights that processing entire books introduces unique challenges that current models are not yet fully equipped to handle.
The research also points out open challenges, such as the need to better understand and improve coreference metrics for full-book evaluation, and the development of new systems that can truly leverage book-scale annotations. Furthermore, it emphasizes the need for efficient solutions to adapt large generative and encoder-only models for book-scale processing without incurring prohibitive computational costs.
By making their data and code publicly available at https://github.com/sapienzanlp/bookcoref, the researchers aim to foster further investigation and development in this crucial area of NLP, paving the way for more accurate and robust coreference resolution in the context of long-form narratives.


