spot_img
HomeResearch & DevelopmentAI Breakthrough: Automating the Extraction of Criminal Facts from...

AI Breakthrough: Automating the Extraction of Criminal Facts from Court Opinions

TLDR: A new research paper details how advanced regular expressions and Large Language Models (LLMs), particularly Gemini Flash 2.0, can effectively extract detailed descriptions of criminal behavior from Slovak court verdicts. This innovative approach significantly outperforms traditional methods, achieving up to 99.5% accuracy in identifying factual statements and closely matching human annotations, offering a scalable solution for legal data analysis.

In the realm of criminal justice, detailed information about offenses is often scarce in administrative datasets. These records typically only note the relevant section of the penal code, providing a general definition but little insight into the specific behaviors involved. However, a rich, untapped source of information exists within the textual descriptions of criminal behaviors found in court verdicts, particularly in continental European countries.

A recent research paper, titled “What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions,” explores the feasibility of automatically extracting these crucial descriptions from publicly available court decisions. Authored by Klára Bendová, Tomáš Knap, Jan ÄŒerný, VojtÄ›ch Pour, Jaromir Savelka, Ivana Kvapilíková, and Jakub Drápal, this study highlights a significant step forward in leveraging technology to enhance empirical legal research.

The challenge lies in the nature of these legal documents. They are often loosely structured and can contain noisy data, making traditional data extraction difficult. The researchers focused on Slovak court verdicts, which are typical of the continental Germanic legal culture. These verdicts are consistently divided into a dispositive part (stating decisions) and reasoning. Crucially, they also employ a unique typographic convention called “sparing,” where letters of a word are spaced out (e.g., L I K E T H I S) to introduce new sections, a feature that proved robust during format conversions.

The study explored three main approaches for extracting factual statements:

Baseline Regular Expressions

Initially, a simple method using regular expressions was employed to identify typical starting and ending phrases of factual statements. This baseline, however, proved insufficient, successfully identifying descriptions in only 40.5% of the verdicts in the test set.

Advanced Regular Expressions

Building on the baseline, more flexible patterns were developed to account for variations like extra spaces and line breaks. This advanced approach specifically leveraged the “sparing” convention, automatically extracting and grouping these expressions to identify openers and closers of factual sentences. This significantly improved performance, achieving a 97% success rate in identifying descriptions.

Large Language Models (LLMs)

Recognizing the limitations of rule-based approaches, especially with the varied nature of language, the researchers turned to Large Language Models. They utilized the Gemini Flash 2.0 model, carefully crafting a prompt that included concrete examples, a clear definition of the expected output, and explicit indicators of where factual statements typically begin and end. A critical breakthrough was incorporating specific textual markers (like those used by the regular expression methods) into the prompt, guiding the model towards focused pattern recognition. To combat occasional text hallucination, a post-processing step was implemented to align the model’s output with the original text, ensuring fidelity. The LLM approach achieved an impressive 98.75% success rate.

Also Read:

Combined Approach

The most effective strategy integrated both advanced regular expressions and the LLM. In this hybrid methodology, advanced regular expressions first attempt to extract factual statements. If this fails, the LLM-based extraction is then applied. This combination maximized efficiency and accuracy, reaching an overall extraction rate of 99.5%.

When evaluating the quality of the extracted sentences, law students found that both advanced methods closely matched human annotations. The LLM fully matched human-labeled descriptions in 91.75% of instances, while advanced regular expressions achieved 89.5% (allowing for minor character-level variations). The combined approach achieved 92% accuracy.

The research also delved into challenging cases, where the LLM still performed high-quality extraction in 84% of instances, even when factual statements used less common grammatical structures or formats. This highlights the need for further prompt refinement to generalize the description of a factual sentence.

This work demonstrates a scalable solution for extracting critical information from legal documents without the need for extensive, costly manual annotations. By combining rule-based heuristics with few-shot prompting of an LLM, the system is adaptable for low-resource legal settings. Future work aims to expand the dataset to include court decisions from additional countries, providing valuable empirical data for comparative criminal law research. You can read the full paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -