spot_img
HomeResearch & DevelopmentUnlocking History: How AI Extracts Information from Handwritten Birth...

Unlocking History: How AI Extracts Information from Handwritten Birth Certificates

TLDR: This study evaluates an AI model (Document Attention Network – DAN) for extracting information from handwritten Uruguayan birth certificates. It compares two data annotation methods: normalized (standardized) and diplomatic (verbatim). Findings show normalized annotation is better for standardizable fields (e.g., dates), while diplomatic annotation excels for non-standardizable fields (e.g., names), suggesting that the optimal annotation strategy depends on the specific data type.

This research explores how artificial intelligence can help extract information from old, handwritten documents, specifically Uruguayan birth certificates. Many historical records around the world are still on paper, and digitizing them is a massive undertaking. This study focuses on making these scanned documents searchable and usable by automatically extracting key details.

The researchers evaluated a system called the Document Attention Network (DAN) for this task. DAN is designed to extract key-value information from documents without needing to manually mark out specific text areas. The interesting part of this study is its focus on two different ways of preparing the data for the AI model: “normalized annotation” and “diplomatic annotation.”

Normalized annotation involves standardizing the information. For example, a date like “May 31, 2014” might be stored as “31 May 2014” regardless of how it was written. Similarly, names might be abbreviated if that’s how they are typically stored in a database. This approach is convenient because it uses data already available in computer systems, reducing the need for manual transcription. It also aims for an output format that’s immediately useful for databases.

Diplomatic annotation, on the other hand, involves transcribing every character exactly as it appears in the handwritten document, including accents, capitalization, and full names, even if they are abbreviated in a normalized database. This method requires more manual effort to create the training data, as someone has to carefully transcribe each document verbatim.

The study used 201 scanned Uruguayan birth certificates, handwritten by over 15 different people. These documents presented challenges like varying margins and different templates over time. The DAN model, originally trained on French handwritten letters, was fine-tuned for this Spanish-language task with a relatively small amount of data.

The findings showed that both annotation strategies yielded good results, comparable to the original DAN paper, even with the language and context change. However, there were clear differences in performance depending on the type of information being extracted.

For fields that can be standardized, such as dates, years of enrollment, jurisdiction, and department, the normalized annotation approach worked better. This suggests that when the desired output is a standardized format, training the model with already normalized data is more efficient and leads to fewer errors. It also saves the effort of post-processing the extracted information.

Conversely, for fields containing names and surnames (like the enrollee’s full name, and parents’ names), diplomatic annotation performed significantly better. This is because names often have variations in handwriting, and the normalized data sometimes included abbreviations or inconsistencies that the model struggled to learn or even “invented” abbreviations where none existed. By providing the model with exact, character-by-character transcriptions, it became much more accurate in extracting these non-standardizable fields.

The research highlights a crucial insight: the choice of annotation strategy should depend on the nature of the data field. For structured, standardizable information, normalized data is efficient. For variable, non-standardizable information like names, a diplomatic, verbatim transcription is superior.

Also Read:

Future work suggested by the authors includes exploring a “hybrid” annotation approach, combining normalized data for standardizable fields and diplomatic data for names. They also plan to investigate how much the training data can be reduced without losing accuracy, and how the model generalizes to different document layouts or even specific handwriting styles. This research provides valuable insights for anyone working on digitizing historical handwritten records, emphasizing the importance of tailored annotation strategies for optimal information extraction. You can find more details about this study in the full research paper available at this link.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -