TLDR: NARRABENCH is a new, theory-informed framework for evaluating how well Large Language Models (LLMs) understand narratives. It introduces a comprehensive taxonomy of 50 narrative tasks across four dimensions (Story, Narration, Discourse, Situatedness) and surveys 78 existing benchmarks. The research reveals that current benchmarks only cover about 27% of narrative understanding tasks, with significant gaps in subjective aspects and areas like events, style, and perspective. NARRABENCH provides a roadmap for future benchmark development, emphasizing the need for more robust and theoretically consistent evaluations to truly measure AI’s narrative comprehension.
In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) are becoming increasingly adept at generating and understanding human-like text. However, truly assessing their ability to grasp the nuances of narrative – the very fabric of human communication – has remained a significant challenge. A new research paper introduces NARRABENCH, a groundbreaking framework designed to provide a comprehensive and theoretically informed approach to benchmarking LLM narrative understanding.
Authored by Sil Hamilton and Matthew Wilkens from Cornell University, and Andrew Piper from McGill University, the paper titled “NARRABENCH: A Comprehensive Framework for Narrative Benchmarking” addresses a critical gap in current evaluation methods. The researchers highlight that while LLMs are frequently tested on tasks that involve narrative skills, these evaluations often lack theoretical consistency and fail to cover the full spectrum of narrative complexity.
The Need for a New Approach
The NARRABENCH team conducted an extensive survey of 78 existing benchmarks related to narrative understanding. Their findings were striking: they estimate that only about 27% of narrative tasks are adequately captured by current benchmarks. This means a vast majority of narrative understanding aspects are either overlooked or poorly aligned with existing metrics. Areas like narrative events, style, perspective, and revelation are notably absent from many evaluations.
Furthermore, the paper points out a significant limitation in existing benchmarks: their overwhelming focus on single, ‘correct’ answers. Narratives, however, are often subjective and open to interpretation. Current evaluations struggle to assess these ‘perspectival’ aspects, where there isn’t a single right answer but rather a distribution of plausible interpretations.
Introducing the NARRABENCH Taxonomy
At the heart of NARRABENCH is a novel taxonomy of fifty distinct narrative understanding tasks. This taxonomy is built upon well-established theoretical frameworks from narratology, integrating decades of literary theory into a computational context. It organizes narrative understanding into two core levels:
- Narrative Aspects: These are hierarchically arranged into fifty specific tasks, mapped to twelve primary narrative features (e.g., Agents, Events, Plot, Perspective, Style, Time, Motivation). These features are further grouped under four fundamental narrative dimensions: Story (what happened), Narration (who speaks), Discourse (how it was told), and Situatedness (the social context).
- Evaluation Criteria: This level defines how tasks should be assessed, considering textual scale (local, global, meso), mode (discrete, progressive, holistic judgments), and the expected variance of answers (deterministic, consensus, perspectival).
This systematic integration provides a unified theoretical framework that allows for future expansion and emphasizes the importance of perspectival alignment in benchmark development.
Key Insights from the Benchmark Survey
The survey revealed several important trends and limitations in current benchmarking efforts:
- Increasing Popularity: A surge in new benchmarks, with half of those surveyed published in 2024 or 2025, indicates a growing interest in evaluating LLMs.
- Beyond Classification: While many benchmarks still rely on classification tasks, there’s a positive shift towards evaluating open-ended text generation, which offers a more direct window into LLM proficiencies.
- Overemphasis on ‘Story’: Current efforts disproportionately focus on story-level content, neglecting more complex dimensions like narration, discourse structure, and social context.
- Missing Areas: There’s a notable lack of benchmarks specifically for event-related tasks, style-specific understanding (like allusion or figurative language), and the subjective nature of narrative responses.
- Language Gap: The vast majority of benchmarks are English-centric, leaving a significant gap in evaluating narrative understanding in non-English and low-resource languages.
- Data Accessibility: Only half of the identified benchmarks made their code and data openly available, hindering reproducibility and collaborative progress.
Also Read:
- Exploring Inductive Reasoning in Large Language Models: A Comprehensive Overview
- Evaluating AI’s Deep Dive into Research: Introducing ELAIPBENCH
Charting a Path Forward
NARRABENCH is not just an analysis; it’s a roadmap. The framework highlights where existing benchmarks fit and, more importantly, where new work is desperately needed. It encourages NLP researchers to develop new benchmarks and data to create a robust resource for assessing LLM performance on a central aspect of human communication.
The authors envision NARRABENCH as an expandable framework, inviting community involvement to fill the identified gaps and propose novel features. They plan to maintain a live spreadsheet and produce a reference implementation – a unified testing harness – to guide future efforts and ensure theoretical consistency.
By providing a structured, extensible foundation for assessing narrative understanding, NARRABENCH aims to centralize and guide the growing efforts in narrative benchmarking, ensuring that LLMs are evaluated not only for their formal proficiency but also for their capacity to reflect the diversity, accountability, and responsibility inherent in human storytelling. For more details, you can read the full research paper here.


