TLDR: This research introduces a new evaluation framework for Large Language Models (LLMs) used in Optical Character Recognition (OCR) of historical documents, specifically 18th-century Russian texts. It addresses limitations of traditional metrics by proposing novel ones like Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), and protocols for contamination control. The study found that Gemini and Qwen models outperform traditional OCR but exhibit “over-historicization,” incorrectly inserting archaic characters. Surprisingly, post-OCR correction often worsens results. The framework provides crucial guidelines for digital humanities scholars selecting and assessing LLMs for historical corpus creation.
The field of digital humanities is increasingly turning to Large Language Models (LLMs) for the challenging task of digitizing historical documents. However, a significant gap has existed in how these powerful AI tools are evaluated for Optical Character Recognition (OCR) in historical contexts. Traditional metrics often fall short, failing to capture unique errors like temporal biases and period-specific inaccuracies that are crucial for creating reliable historical corpora.
A new research paper, titled “Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities,” by Maria Levchenko, addresses this critical need. The study introduces a comprehensive evaluation methodology specifically designed for LLM-based historical OCR, focusing on issues like contamination risks and systematic biases in diplomatic transcription.
The Challenge with Historical Texts
Traditional OCR systems struggle immensely with historical documents due to non-standard typography, evolving orthographic conventions, and often degraded physical conditions. Unlike conventional OCR models, where researchers can control training data and fine-tune parameters, LLMs are black boxes. We cannot access their training data or modify their internal workings, necessitating new evaluation approaches that focus on external factors like prompt engineering and processing modes.
Standard metrics like Character Error Rate (CER) and Word Error Rate (WER) are insufficient because they don’t account for LLM-specific behaviors such as “temporal conflation” – where models incorrectly apply orthographic features from different historical periods – or the insertion of anachronistic elements. Furthermore, the risk of training data contamination, where evaluation texts might have been included in an LLM’s pretraining, undermines traditional benchmarking.
A Novel Evaluation Framework
The researchers tackled these challenges using 18th-century Russian texts printed in Civil font, a particularly difficult domain due to distinctive orthographic elements (like ‘i’, ‘Ñ£’, ‘ÑŠ’ at word endings), archaic grammatical forms, and their underrepresentation in digital corpora. The framework introduces several key innovations:
- Contamination-aware dataset creation protocols to ensure evaluation integrity.
- Novel metrics like **Historical Character Preservation Rate (HCPR)** and **Archaic Insertion Rate (AIR)**, designed to specifically capture LLM behaviors in historical contexts.
- Systematic analysis of different processing modes and prompt engineering strategies.
- Comprehensive stability testing to account for LLM output variability.
- Feature sensitivity analysis to identify document characteristics that impact performance.
The study evaluated 12 leading commercial and open-source multimodal LLMs on a new dataset of 1,029 pages from 428 unique 18th-century Russian books. The ground truth for this corpus was meticulously prepared through a multi-stage process involving layout analysis, initial OCR, and 100% manual correction by an expert annotator, adhering to diplomatic transcription principles.
Key Findings and Surprising Behaviors
The evaluation revealed systematic patterns in LLM behavior previously undocumented. While Gemini and Qwen models generally outperformed traditional OCR systems, a striking phenomenon termed “over-historicization” was observed. LLMs systematically inserted archaic Slavonic characters that had already been eliminated from the target historical period. This suggests that LLMs, lacking explicit period awareness, might generalize from a noisy mix of training data, interpreting rare or visually distinctive archaic forms as generic signals for “historical text” regardless of actual period accuracy.
Another counterintuitive finding was that post-OCR correction, where both the image and the OCR text were provided to higher-performing models, often degraded rather than improved performance. Models tended to re-perform OCR from the image rather than applying constrained edits to the provided text. Text-only correction consistently worsened results, with models introducing new errors.
Regarding processing modes, “Full Page Processing” generally yielded the best accuracy for most models, maximizing contextual information. However, for models highly sensitive to document length, a “Line-by-Line” mode could be preferable. Prompt engineering also played a role; context-enhanced Russian prompts led to statistically significant error reductions for several models, though the best models were less dependent on prompt tweaks.
The research also highlighted specific error patterns, such as frequent confusion between visually similar character pairs (e.g., ‘Ñ‚’ and ‘ш’) and difficulties in preserving period-specific characters like ‘Ñ—’ and ‘Ñ£’, or correctly handling the hard sign ‘ÑŠ’.
Also Read:
- A New Framework for Classifying Language Model Hallucinations
- Assessing Legal AI: How LeMAJ Emulates Human Lawyers
Implications for Digital Humanities
This methodology provides digital humanities practitioners with crucial guidelines for model selection and quality assessment in historical corpus digitization. The findings suggest that focusing on selecting optimal models for direct OCR is more effective than relying on post-correction pipelines, which often provide no benefit or actively harm accuracy.
The paper acknowledges limitations, including the dataset’s specificity to 18th-century Russian Civil font and the inherent non-determinism of LLM outputs. However, it sets a precedent for ongoing, transparent tracking of model progress in this vital area. For more details, you can read the full research paper here.


