Automating Insights from Oral Histories with Language Models

TLDR: This research introduces a scalable framework using Large Language Models (LLMs) to automatically analyze large oral history archives, specifically focusing on Japanese American Incarceration narratives. By combining expert human annotation with advanced prompt engineering for LLMs like ChatGPT, Llama, and Qwen, the study successfully performs semantic and sentiment classification on over 92,000 sentences, demonstrating that LLMs can effectively extract meaning and emotional tone from historically sensitive, unstructured data. The findings highlight the crucial role of prompt design and the potential of LLMs to enhance the accessibility and interpretation of historical testimonies.

Oral histories are invaluable records of personal experiences, offering unique perspectives often missing from official historical accounts. They are particularly crucial for understanding communities that have faced systemic injustice and historical erasure. However, analyzing vast archives of oral histories has traditionally been a challenging task due to their unstructured nature, the emotional depth they contain, and the high cost of manual annotation.

A recent research paper, titled Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis, proposes a groundbreaking solution to these challenges. Authored by Komala Subramanyam Cherukuri, Pranav Abishai Moses, and Aisa Sakata from the University of North Texas, along with Jiangping Chen from the University of Illinois Urbana-Champaign, and Haihua Chen from the University of North Texas, this study introduces a scalable framework to automate the semantic and sentiment annotation of oral history archives.

Bridging History and AI

The researchers focused their efforts on the Japanese American Incarceration Oral History (JAIOH) collection, a historically sensitive corpus. Their goal was to leverage Large Language Models (LLMs) to construct a high-quality dataset, systematically evaluate the performance of various LLMs, and explore effective prompt engineering strategies for annotation in such delicate contexts.

The methodology involved a multi-phase approach. Initially, a small set of 558 sentences from 15 narrators were meticulously labeled by human experts for both sentiment (positive, neutral, negative) and semantic classification (six categories like ‘Biographical Information’, ‘Life During Incarceration’, ‘Military Service’, etc.). This expertly annotated data served as a benchmark.

Following this, the team experimented with prominent LLMs, including ChatGPT, Llama, and Qwen. They tested different ‘prompt engineering’ strategies – how instructions are given to the LLMs – such as zero-shot (no examples), few-shot (a few examples), and retrieval-augmented generation (RAG, where the model gets additional context from external sources). This allowed them to identify the most effective ways to guide the LLMs for accurate annotation.

Key Findings and Performance

The study yielded significant insights into LLM capabilities for oral history analysis. For semantic classification, ChatGPT demonstrated the highest accuracy, achieving an F-1 score of 88.71%. Llama and Qwen also performed strongly, with F-1 scores of 84.99% and 83.72% respectively. In sentiment analysis, Llama slightly edged out the others with an F-1 score of 82.87%, followed closely by Qwen (82.66%) and ChatGPT (82.29%). All models showed comparable and impressive results for sentiment.

A crucial finding was the importance of prompt design. For sentiment analysis, concise prompts (shorter, focused instructions) often outperformed longer, more detailed ones across all models. However, for the more complex semantic classification, refined and detailed prompts proved to be more effective, especially for Llama, which benefited significantly from structured and context-aware instructions.

Based on these evaluations, the best-performing prompt configurations were selected for each task. These optimized settings were then used to automatically annotate a massive corpus of 92,191 sentences from 1,002 interviews within the JAIOH collection. This automated process created a large-scale annotated oral history corpus, providing a valuable resource for future research.

Unlocking Historical Narratives

The automated annotation revealed fascinating patterns within the Japanese American incarceration narratives. For instance, ‘Life During Incarceration’ emerged as the most prevalent semantic category, comprising a significant portion of the sentences. The sentiment analysis showed that personal hardship and injustice were often associated with negative sentiment, while descriptive or aspirational content tended to be more neutral.

Beyond classification, the researchers also performed entity extraction (identifying people, places, organizations) and topic modeling using BERTopic. This allowed for a deeper understanding of how narrators referenced specific elements in their stories and how different themes were framed emotionally. For example, topics like ‘education’ or ‘camp life’ appeared across different sentiments but with distinct emotional tones.

Also Read:

Implications for Digital Humanities and Beyond

This research offers a reusable annotation pipeline and practical guidance for applying LLMs in low-resource, culturally sensitive domains. It demonstrates that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by carefully designed prompts. By bridging archival ethics with scalable Natural Language Processing techniques, this work lays the groundwork for the responsible use of artificial intelligence in digital humanities and the preservation of collective memory.

The creation of this large-scale annotated dataset directly benefits the archiving community and researchers, providing a foundation for developing sophisticated tools for analyzing personal narratives. It also has broader societal implications, enabling the integration of thematically organized narratives into educational materials and informing public policy by surfacing lived experiences.

While the study acknowledges limitations, such as its focus on a single dataset, it opens avenues for future work, including evaluating the framework on diverse oral history corpora and further refining prompt optimization techniques. This thoughtful integration of human expertise and machine learning promises to enhance the accessibility, structure, and interpretability of vast oral history archives for generations to come.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating Insights from Oral Histories with Language Models

Bridging History and AI

Key Findings and Performance

Unlocking Historical Narratives

Implications for Digital Humanities and Beyond

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates