Unlocking History: How AI Extracts Information from Handwritten Birth Certificates

TLDR: This study evaluates an AI model (Document Attention Network – DAN) for extracting information from handwritten Uruguayan birth certificates. It compares two data annotation methods: normalized (standardized) and diplomatic (verbatim). Findings show normalized annotation is better for standardizable fields (e.g., dates), while diplomatic annotation excels for non-standardizable fields (e.g., names), suggesting that the optimal annotation strategy depends on the specific data type.

This research explores how artificial intelligence can help extract information from old, handwritten documents, specifically Uruguayan birth certificates. Many historical records around the world are still on paper, and digitizing them is a massive undertaking. This study focuses on making these scanned documents searchable and usable by automatically extracting key details.

The researchers evaluated a system called the Document Attention Network (DAN) for this task. DAN is designed to extract key-value information from documents without needing to manually mark out specific text areas. The interesting part of this study is its focus on two different ways of preparing the data for the AI model: “normalized annotation” and “diplomatic annotation.”

Normalized annotation involves standardizing the information. For example, a date like “May 31, 2014” might be stored as “31 May 2014” regardless of how it was written. Similarly, names might be abbreviated if that’s how they are typically stored in a database. This approach is convenient because it uses data already available in computer systems, reducing the need for manual transcription. It also aims for an output format that’s immediately useful for databases.

Diplomatic annotation, on the other hand, involves transcribing every character exactly as it appears in the handwritten document, including accents, capitalization, and full names, even if they are abbreviated in a normalized database. This method requires more manual effort to create the training data, as someone has to carefully transcribe each document verbatim.

The study used 201 scanned Uruguayan birth certificates, handwritten by over 15 different people. These documents presented challenges like varying margins and different templates over time. The DAN model, originally trained on French handwritten letters, was fine-tuned for this Spanish-language task with a relatively small amount of data.

The findings showed that both annotation strategies yielded good results, comparable to the original DAN paper, even with the language and context change. However, there were clear differences in performance depending on the type of information being extracted.

For fields that can be standardized, such as dates, years of enrollment, jurisdiction, and department, the normalized annotation approach worked better. This suggests that when the desired output is a standardized format, training the model with already normalized data is more efficient and leads to fewer errors. It also saves the effort of post-processing the extracted information.

Conversely, for fields containing names and surnames (like the enrollee’s full name, and parents’ names), diplomatic annotation performed significantly better. This is because names often have variations in handwriting, and the normalized data sometimes included abbreviations or inconsistencies that the model struggled to learn or even “invented” abbreviations where none existed. By providing the model with exact, character-by-character transcriptions, it became much more accurate in extracting these non-standardizable fields.

The research highlights a crucial insight: the choice of annotation strategy should depend on the nature of the data field. For structured, standardizable information, normalized data is efficient. For variable, non-standardizable information like names, a diplomatic, verbatim transcription is superior.

Also Read:

Future work suggested by the authors includes exploring a “hybrid” annotation approach, combining normalized data for standardizable fields and diplomatic data for names. They also plan to investigate how much the training data can be reduced without losing accuracy, and how the model generalizes to different document layouts or even specific handwriting styles. This research provides valuable insights for anyone working on digitizing historical handwritten records, emphasizing the importance of tailored annotation strategies for optimal information extraction. You can find more details about this study in the full research paper available at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking History: How AI Extracts Information from Handwritten Birth Certificates

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates