Semantic Analysis of Instant Messages: Aiding Criminal Investigations with Knowledge Graphs and NLP

TLDR: A new system integrates knowledge graphs and NLP models to streamline the analysis of instant messaging data in criminal investigations. It semantically enriches data by modeling it in a knowledge graph, transcribing voice messages, and extracting entities. This allows investigators to perform advanced searches and visualize relationships, significantly reducing manual effort and providing crucial insights while maintaining data traceability and allowing human verification of evidence.

Criminal investigations often face a significant challenge: analyzing the vast amounts of data exchanged through instant messaging applications like WhatsApp. This task is incredibly time-consuming and resource-intensive, especially when dealing with voice messages. To address this, a new approach integrates knowledge graphs and Natural Language Processing (NLP) models to semantically enrich this digital evidence, making it easier for prosecutors and investigators to search, visualize, and gain valuable insights.

The Core Approach: Knowledge Graphs and NLP

The proposed solution involves several key steps to transform raw instant messaging data into an organized and searchable format. First, message data and its associated metadata (like participants, timestamps, and message types) are extracted and modeled using a knowledge graph (KG). This graph, stored in a system like Neo4j, represents the relational data and content of the messages, providing a structured overview of communications.

A crucial component is multimedia enrichment, particularly for voice messages. Using advanced speech-to-text technology, such as the Whisper APIs, audio messages are transcribed into text. This makes their content accessible for search functionalities and further NLP processing. While currently focused on audio, the system is designed for future extensions to include image-to-text and video-to-text enrichment.

Following transcription, an end-to-end entity extraction pipeline is applied. This NLP process identifies and annotates entities—such as people, organizations, and locations—within the message content and transcripts. It combines Named Entity Recognition (NER) to find mentions, Named Entity Linking (NEL) to connect them to existing knowledge (like chat participants or Wikipedia entities), and entity clustering to group mentions referring to the same real-world entity. This enriched information is then used to update the knowledge graph.

Accessing and Exploring the Data

To help users interact with this semantically enriched data, two main interfaces have been developed. One is the native Neo4j user interface, which allows investigators to query and visually explore the knowledge graph. This provides a comprehensive view of relationships and connections within the investigative data.

The second interface is DAVE (Document Annotation Validation and Exploration), a web application designed for faceted search. Users can perform traditional keyword searches and then filter results based on specific metadata (e.g., message sender) or the semantic annotations (e.g., people, organizations, or places mentioned). DAVE also supports a “human-in-the-loop” approach, allowing users to edit and correct algorithm-predicted annotations, ensuring the quality and reliability of the extracted knowledge. This is particularly important given the sensitive nature of legal investigations.

Also Read:

Real-World Application and Feedback

This innovative approach has been developed within a larger national Italian project and has undergone practical applications with real investigation data from two cases: one involving fraud and another corruption. The project has processed a significant volume of data, including thousands of chats and hundreds of thousands of messages and attachments.

Investigators provided positive feedback, finding the graph-based visualizations helpful and easy to understand. They particularly appreciated the combination of text search with relational queries. The speech-to-text technology was seen as potentially transformative, drastically reducing the time and effort required to process audio evidence. While acknowledging occasional transcription mistakes, investigators were not concerned, as the system is designed to allow verification against original audio files. The NLP capabilities, enabling quick browsing and entity-based filtering, were also highly valued compared to previous tools that only supported syntactic pattern matching.

The research paper, available at arXiv:2509.26487, highlights that while the solution shows great promise, ongoing work is needed to further refine the NLP pipeline through in-domain fine-tuning and to develop more intuitive conversational search interfaces. The ultimate goal is to empower human decision-makers with better tools to explore evidence, build stronger arguments, and improve the efficiency of criminal investigations, rather than replacing human judgment.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Semantic Analysis of Instant Messages: Aiding Criminal Investigations with Knowledge Graphs and NLP

The Core Approach: Knowledge Graphs and NLP

Accessing and Exploring the Data

Real-World Application and Feedback

Gen AI News and Updates

A New Way to Disentangle Data for Scientific Exploration

Building Persistent Intelligence: Exploring MemoriesDB for AI Memory Management

OntoTune: Semantic Intelligence for Database Query Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates