TLDR: This research introduces a scalable framework using Large Language Models (LLMs) to automatically analyze large oral history archives, specifically focusing on Japanese American Incarceration narratives. By combining expert human annotation with advanced prompt engineering for LLMs like ChatGPT, Llama, and Qwen, the study successfully performs semantic and sentiment classification on over 92,000 sentences, demonstrating that LLMs can effectively extract meaning and emotional tone from historically sensitive, unstructured data. The findings highlight the crucial role of prompt design and the potential of LLMs to enhance the accessibility and interpretation of historical testimonies.
Oral histories are invaluable records of personal experiences, offering unique perspectives often missing from official historical accounts. They are particularly crucial for understanding communities that have faced systemic injustice and historical erasure. However, analyzing vast archives of oral histories has traditionally been a challenging task due to their unstructured nature, the emotional depth they contain, and the high cost of manual annotation.
A recent research paper, titled Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis, proposes a groundbreaking solution to these challenges. Authored by Komala Subramanyam Cherukuri, Pranav Abishai Moses, and Aisa Sakata from the University of North Texas, along with Jiangping Chen from the University of Illinois Urbana-Champaign, and Haihua Chen from the University of North Texas, this study introduces a scalable framework to automate the semantic and sentiment annotation of oral history archives.
Bridging History and AI
The researchers focused their efforts on the Japanese American Incarceration Oral History (JAIOH) collection, a historically sensitive corpus. Their goal was to leverage Large Language Models (LLMs) to construct a high-quality dataset, systematically evaluate the performance of various LLMs, and explore effective prompt engineering strategies for annotation in such delicate contexts.
The methodology involved a multi-phase approach. Initially, a small set of 558 sentences from 15 narrators were meticulously labeled by human experts for both sentiment (positive, neutral, negative) and semantic classification (six categories like ‘Biographical Information’, ‘Life During Incarceration’, ‘Military Service’, etc.). This expertly annotated data served as a benchmark.
Following this, the team experimented with prominent LLMs, including ChatGPT, Llama, and Qwen. They tested different ‘prompt engineering’ strategies – how instructions are given to the LLMs – such as zero-shot (no examples), few-shot (a few examples), and retrieval-augmented generation (RAG, where the model gets additional context from external sources). This allowed them to identify the most effective ways to guide the LLMs for accurate annotation.
Key Findings and Performance
The study yielded significant insights into LLM capabilities for oral history analysis. For semantic classification, ChatGPT demonstrated the highest accuracy, achieving an F-1 score of 88.71%. Llama and Qwen also performed strongly, with F-1 scores of 84.99% and 83.72% respectively. In sentiment analysis, Llama slightly edged out the others with an F-1 score of 82.87%, followed closely by Qwen (82.66%) and ChatGPT (82.29%). All models showed comparable and impressive results for sentiment.
A crucial finding was the importance of prompt design. For sentiment analysis, concise prompts (shorter, focused instructions) often outperformed longer, more detailed ones across all models. However, for the more complex semantic classification, refined and detailed prompts proved to be more effective, especially for Llama, which benefited significantly from structured and context-aware instructions.
Based on these evaluations, the best-performing prompt configurations were selected for each task. These optimized settings were then used to automatically annotate a massive corpus of 92,191 sentences from 1,002 interviews within the JAIOH collection. This automated process created a large-scale annotated oral history corpus, providing a valuable resource for future research.
Unlocking Historical Narratives
The automated annotation revealed fascinating patterns within the Japanese American incarceration narratives. For instance, ‘Life During Incarceration’ emerged as the most prevalent semantic category, comprising a significant portion of the sentences. The sentiment analysis showed that personal hardship and injustice were often associated with negative sentiment, while descriptive or aspirational content tended to be more neutral.
Beyond classification, the researchers also performed entity extraction (identifying people, places, organizations) and topic modeling using BERTopic. This allowed for a deeper understanding of how narrators referenced specific elements in their stories and how different themes were framed emotionally. For example, topics like ‘education’ or ‘camp life’ appeared across different sentiments but with distinct emotional tones.
Also Read:
- Enhancing Empathetic Understanding in Speech AI: New Approaches for Large Speech-Language Models
- Detecting Positive Interactions in Online Games: A Scalable AI Approach
Implications for Digital Humanities and Beyond
This research offers a reusable annotation pipeline and practical guidance for applying LLMs in low-resource, culturally sensitive domains. It demonstrates that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by carefully designed prompts. By bridging archival ethics with scalable Natural Language Processing techniques, this work lays the groundwork for the responsible use of artificial intelligence in digital humanities and the preservation of collective memory.
The creation of this large-scale annotated dataset directly benefits the archiving community and researchers, providing a foundation for developing sophisticated tools for analyzing personal narratives. It also has broader societal implications, enabling the integration of thematically organized narratives into educational materials and informing public policy by surfacing lived experiences.
While the study acknowledges limitations, such as its focus on a single dataset, it opens avenues for future work, including evaluating the framework on diverse oral history corpora and further refining prompt optimization techniques. This thoughtful integration of human expertise and machine learning promises to enhance the accessibility, structure, and interpretability of vast oral history archives for generations to come.


