spot_img
HomeResearch & DevelopmentSemantic Analysis of Instant Messages: Aiding Criminal Investigations with...

Semantic Analysis of Instant Messages: Aiding Criminal Investigations with Knowledge Graphs and NLP

TLDR: A new system integrates knowledge graphs and NLP models to streamline the analysis of instant messaging data in criminal investigations. It semantically enriches data by modeling it in a knowledge graph, transcribing voice messages, and extracting entities. This allows investigators to perform advanced searches and visualize relationships, significantly reducing manual effort and providing crucial insights while maintaining data traceability and allowing human verification of evidence.

Criminal investigations often face a significant challenge: analyzing the vast amounts of data exchanged through instant messaging applications like WhatsApp. This task is incredibly time-consuming and resource-intensive, especially when dealing with voice messages. To address this, a new approach integrates knowledge graphs and Natural Language Processing (NLP) models to semantically enrich this digital evidence, making it easier for prosecutors and investigators to search, visualize, and gain valuable insights.

The Core Approach: Knowledge Graphs and NLP

The proposed solution involves several key steps to transform raw instant messaging data into an organized and searchable format. First, message data and its associated metadata (like participants, timestamps, and message types) are extracted and modeled using a knowledge graph (KG). This graph, stored in a system like Neo4j, represents the relational data and content of the messages, providing a structured overview of communications.

A crucial component is multimedia enrichment, particularly for voice messages. Using advanced speech-to-text technology, such as the Whisper APIs, audio messages are transcribed into text. This makes their content accessible for search functionalities and further NLP processing. While currently focused on audio, the system is designed for future extensions to include image-to-text and video-to-text enrichment.

Following transcription, an end-to-end entity extraction pipeline is applied. This NLP process identifies and annotates entities—such as people, organizations, and locations—within the message content and transcripts. It combines Named Entity Recognition (NER) to find mentions, Named Entity Linking (NEL) to connect them to existing knowledge (like chat participants or Wikipedia entities), and entity clustering to group mentions referring to the same real-world entity. This enriched information is then used to update the knowledge graph.

Accessing and Exploring the Data

To help users interact with this semantically enriched data, two main interfaces have been developed. One is the native Neo4j user interface, which allows investigators to query and visually explore the knowledge graph. This provides a comprehensive view of relationships and connections within the investigative data.

The second interface is DAVE (Document Annotation Validation and Exploration), a web application designed for faceted search. Users can perform traditional keyword searches and then filter results based on specific metadata (e.g., message sender) or the semantic annotations (e.g., people, organizations, or places mentioned). DAVE also supports a “human-in-the-loop” approach, allowing users to edit and correct algorithm-predicted annotations, ensuring the quality and reliability of the extracted knowledge. This is particularly important given the sensitive nature of legal investigations.

Also Read:

Real-World Application and Feedback

This innovative approach has been developed within a larger national Italian project and has undergone practical applications with real investigation data from two cases: one involving fraud and another corruption. The project has processed a significant volume of data, including thousands of chats and hundreds of thousands of messages and attachments.

Investigators provided positive feedback, finding the graph-based visualizations helpful and easy to understand. They particularly appreciated the combination of text search with relational queries. The speech-to-text technology was seen as potentially transformative, drastically reducing the time and effort required to process audio evidence. While acknowledging occasional transcription mistakes, investigators were not concerned, as the system is designed to allow verification against original audio files. The NLP capabilities, enabling quick browsing and entity-based filtering, were also highly valued compared to previous tools that only supported syntactic pattern matching.

The research paper, available at arXiv:2509.26487, highlights that while the solution shows great promise, ongoing work is needed to further refine the NLP pipeline through in-domain fine-tuning and to develop more intuitive conversational search interfaces. The ultimate goal is to empower human decision-makers with better tools to explore evidence, build stronger arguments, and improve the efficiency of criminal investigations, rather than replacing human judgment.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -