Unveiling AI's Quirks in Digitizing Historical Texts: A New Evaluation Approach

TLDR: This research introduces a new evaluation framework for Large Language Models (LLMs) used in Optical Character Recognition (OCR) of historical documents, specifically 18th-century Russian texts. It addresses limitations of traditional metrics by proposing novel ones like Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), and protocols for contamination control. The study found that Gemini and Qwen models outperform traditional OCR but exhibit “over-historicization,” incorrectly inserting archaic characters. Surprisingly, post-OCR correction often worsens results. The framework provides crucial guidelines for digital humanities scholars selecting and assessing LLMs for historical corpus creation.

The field of digital humanities is increasingly turning to Large Language Models (LLMs) for the challenging task of digitizing historical documents. However, a significant gap has existed in how these powerful AI tools are evaluated for Optical Character Recognition (OCR) in historical contexts. Traditional metrics often fall short, failing to capture unique errors like temporal biases and period-specific inaccuracies that are crucial for creating reliable historical corpora.

A new research paper, titled “Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities,” by Maria Levchenko, addresses this critical need. The study introduces a comprehensive evaluation methodology specifically designed for LLM-based historical OCR, focusing on issues like contamination risks and systematic biases in diplomatic transcription.

The Challenge with Historical Texts

Traditional OCR systems struggle immensely with historical documents due to non-standard typography, evolving orthographic conventions, and often degraded physical conditions. Unlike conventional OCR models, where researchers can control training data and fine-tune parameters, LLMs are black boxes. We cannot access their training data or modify their internal workings, necessitating new evaluation approaches that focus on external factors like prompt engineering and processing modes.

Standard metrics like Character Error Rate (CER) and Word Error Rate (WER) are insufficient because they don’t account for LLM-specific behaviors such as “temporal conflation” – where models incorrectly apply orthographic features from different historical periods – or the insertion of anachronistic elements. Furthermore, the risk of training data contamination, where evaluation texts might have been included in an LLM’s pretraining, undermines traditional benchmarking.

A Novel Evaluation Framework

The researchers tackled these challenges using 18th-century Russian texts printed in Civil font, a particularly difficult domain due to distinctive orthographic elements (like ‘i’, ‘ѣ’, ‘ъ’ at word endings), archaic grammatical forms, and their underrepresentation in digital corpora. The framework introduces several key innovations:

Contamination-aware dataset creation protocols to ensure evaluation integrity.
Novel metrics like **Historical Character Preservation Rate (HCPR)** and **Archaic Insertion Rate (AIR)**, designed to specifically capture LLM behaviors in historical contexts.
Systematic analysis of different processing modes and prompt engineering strategies.
Comprehensive stability testing to account for LLM output variability.
Feature sensitivity analysis to identify document characteristics that impact performance.

The study evaluated 12 leading commercial and open-source multimodal LLMs on a new dataset of 1,029 pages from 428 unique 18th-century Russian books. The ground truth for this corpus was meticulously prepared through a multi-stage process involving layout analysis, initial OCR, and 100% manual correction by an expert annotator, adhering to diplomatic transcription principles.

Key Findings and Surprising Behaviors

The evaluation revealed systematic patterns in LLM behavior previously undocumented. While Gemini and Qwen models generally outperformed traditional OCR systems, a striking phenomenon termed “over-historicization” was observed. LLMs systematically inserted archaic Slavonic characters that had already been eliminated from the target historical period. This suggests that LLMs, lacking explicit period awareness, might generalize from a noisy mix of training data, interpreting rare or visually distinctive archaic forms as generic signals for “historical text” regardless of actual period accuracy.

Another counterintuitive finding was that post-OCR correction, where both the image and the OCR text were provided to higher-performing models, often degraded rather than improved performance. Models tended to re-perform OCR from the image rather than applying constrained edits to the provided text. Text-only correction consistently worsened results, with models introducing new errors.

Regarding processing modes, “Full Page Processing” generally yielded the best accuracy for most models, maximizing contextual information. However, for models highly sensitive to document length, a “Line-by-Line” mode could be preferable. Prompt engineering also played a role; context-enhanced Russian prompts led to statistically significant error reductions for several models, though the best models were less dependent on prompt tweaks.

The research also highlighted specific error patterns, such as frequent confusion between visually similar character pairs (e.g., ‘т’ and ‘ш’) and difficulties in preserving period-specific characters like ‘ї’ and ‘ѣ’, or correctly handling the hard sign ‘ъ’.

Also Read:

Implications for Digital Humanities

This methodology provides digital humanities practitioners with crucial guidelines for model selection and quality assessment in historical corpus digitization. The findings suggest that focusing on selecting optimal models for direct OCR is more effective than relying on post-correction pipelines, which often provide no benefit or actively harm accuracy.

The paper acknowledges limitations, including the dataset’s specificity to 18th-century Russian Civil font and the inherent non-determinism of LLM outputs. However, it sets a precedent for ongoing, transparent tracking of model progress in this vital area. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling AI’s Quirks in Digitizing Historical Texts: A New Evaluation Approach

The Challenge with Historical Texts

A Novel Evaluation Framework

Key Findings and Surprising Behaviors

Implications for Digital Humanities

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates