RECAP: A New Hybrid System for Accurate PII Detection in Many Languages

TLDR: RECAP is a novel hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable Personally Identifiable Information (PII) detection. It supports over 300 entity types across 13 low-resource locales without retraining, utilizing a three-phase refinement pipeline for disambiguation and filtering. Benchmarked against fine-tuned NER models and zero-shot LLMs, RECAP significantly outperforms them, achieving an 82% higher weighted F1-score than NER models and 17% higher than zero-shot LLMs, offering a robust solution for privacy compliance in diverse linguistic environments.

In today’s digital age, protecting Personally Identifiable Information (PII) is more crucial than ever. With the explosion of user-generated content, PII often finds its way into vast data repositories, creating significant privacy risks and compliance challenges. Regulations like GDPR, HIPAA, and CCPA underscore the importance of robust PII detection systems. However, existing methods often struggle with the complexities of low-resource languages, where linguistic diversity and limited annotated data pose substantial hurdles.

Introducing RECAP: A Hybrid Framework for Multilingual PII Detection

A new research paper introduces RECAP (REgex and Context-Aware Prompting), a novel hybrid framework designed to overcome these challenges. RECAP combines the precision of deterministic regular expressions with the semantic understanding of context-aware large language models (LLMs) to achieve scalable PII detection across 13 diverse low-resource locales. This innovative approach supports over 300 entity types without requiring any model retraining, making it highly adaptable and efficient.

Addressing Core Challenges in PII Detection

The RECAP framework directly tackles several key issues that plague traditional PII detection systems:

Low-Resource Performance Gap: Many existing systems perform poorly in languages with limited training data and linguistic resources. RECAP’s design mitigates this by not requiring extensive annotated data for each new language.
Scalability Bottleneck: Pure regex methods lack semantic understanding, leading to high false positives. Transformer-based Named Entity Recognition (NER) models have limited PII type coverage, and standalone LLMs can be inconsistent and prone to hallucination. RECAP integrates the strengths of both to provide comprehensive coverage.
Ambiguity and Variation: PII entities can vary greatly in structure and meaning across different locales. RECAP’s multi-phase refinement pipeline is specifically designed to reduce ambiguity and false positives, ensuring higher reliability.

How RECAP Works: A Three-Phase Refinement Pipeline

RECAP’s architecture is modular and locale-aware, meaning each of the 13 supported locales has its own dedicated detector with specific regex patterns and optimized LLM prompts. The core of its effectiveness lies in a three-phase refinement pipeline:

Baseline Hybrid Detection: This initial phase uses regular expressions to identify structured PIIs (like national IDs or IP addresses) and a zero-shot LLM (GPT-4o) with a carefully engineered prompt to detect unstructured PIIs (such as names, addresses, usernames, and passwords). This provides broad initial coverage but can lead to multi-labeling, span overlaps, and contextual false positives.
Context-based Multi-label Resolution: To address ambiguity where a single entity might receive multiple labels, this phase leverages the LLM’s semantic understanding. It analyzes the surrounding context of multi-labeled entities to select the single most appropriate label, significantly boosting precision and recall.
Ambiguity Resolution and Entity Consolidation: The final phase applies two targeted filters. First, it resolves entity span overlaps by prioritizing longer, more specific spans. Second, it employs contextual false positive filtering for short numeric entities (like age or CVV) by using the LLM to verify their semantic plausibility within the local text window. This drastically reduces incorrect detections and improves overall precision.

Impressive Benchmark Results

The researchers rigorously benchmarked RECAP against two strong baselines: transformer-based NER models and zero-shot LLMs. The evaluation dataset was expertly crafted with synthetic PII across six domains and varied text lengths, using the nervaluate library for precise span accuracy. RECAP consistently outperformed both baselines across most locales. For instance, in Polish (PL_PL), RECAP achieved a 130.77% relative improvement over the NER baseline and a 22.45% improvement over the zero-shot LLM in weighted F1-score. Notably, RECAP achieved a weighted recall of 0.605, significantly higher than 0.362 for NER and 0.437 for zero-shot LLMs, highlighting its strength in minimizing missed detections—a critical aspect for privacy compliance.

This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications, particularly in challenging low-resource linguistic environments. For more in-depth information, you can read the full research paper here.

Also Read:

Limitations and Future Directions

While RECAP shows significant promise, the authors acknowledge certain limitations, such as the reliance on a single LLM (GPT-4o) and the use of synthetic benchmark data. Future work aims to explore automatic prompt optimization, perturbation-based evaluation for robustness, and the application of knowledge distillation for on-device inference.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RECAP: A New Hybrid System for Accurate PII Detection in Many Languages

Introducing RECAP: A Hybrid Framework for Multilingual PII Detection

Addressing Core Challenges in PII Detection

How RECAP Works: A Three-Phase Refinement Pipeline

Impressive Benchmark Results

Limitations and Future Directions

Gen AI News and Updates

Visier Unveils Model Context Protocol (MCP) for AI Agents to Govern People Data Across Enterprises

Nokod Security Unveils Adaptive Agent Security for Comprehensive AI Agent Protection

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates