Enhancing Process Data Quality with Graph Neural Networks for Event Log Repair

TLDR: A new method called SANAGRAPH uses Heterogeneous Graph Neural Networks (HGNNs) to repair missing information in event logs, which are crucial for process analysis. Unlike previous methods that often focus on specific attributes, SANAGRAPH can reconstruct all types of missing event attributes, including entire events, by representing event logs as graphs and leveraging the relationships within the data. It shows strong performance in restoring missing activities and timestamps across various real-world datasets.

In the world of Process Mining, where organizations analyze and improve their operations using data from event logs, the quality and completeness of these logs are paramount. However, real-world event logs often suffer from missing or incomplete information, making accurate analysis a significant challenge. Traditional methods for repairing these logs either rely on a pre-existing process model to infer missing values or use Machine Learning (ML) and Deep Learning (DL) models to learn from complete cases. While these approaches have shown promise, they often struggle to capture the full complexity of how different pieces of information within an event log relate to each other, frequently focusing on only a subset of event attributes.

Introducing SANAGRAPH: A New Approach to Event Log Repair

A new research paper, “Graph-based Event Log Repair”, introduces SANAGRAPH, a novel method that leverages Heterogeneous Graph Neural Networks (HGNNs) to address the problem of missing information in event logs. This approach is particularly innovative because it can reconstruct the full set of attributes missing from events, even when entire events are absent from a trace. The authors of this work are Sebastiano Dissegna and Chiara Di Francescomarino from the University of Trento, and Massimiliano Ronzani from Fondazione Bruno Kessler.

Graph Neural Networks (GNNs) are a type of Deep Learning model specifically designed to handle data structured as graphs, which naturally represent relationships and dependencies. In the context of Process Mining, GNNs offer a powerful framework for modeling execution traces, where events and their attributes can be represented as nodes and edges in a graph. SANAGRAPH takes this a step further by using Heterogeneous GNNs, which allow for different types of nodes for different event attributes, creating a richer and more semantically meaningful representation of the data.

How SANAGRAPH Works

SANAGRAPH transforms event log traces into heterogeneous graphs. Each attribute of an event (like activity, timestamp, or resource) becomes a node in the graph. Missing values are marked, and the problem of repairing the log becomes a ‘node classification’ task within the graph. The HGNN model allows information to flow across the graph, enriching the ’empty’ nodes with context from their neighbors. After several layers of processing, linear layers then classify these empty nodes into their correct values, whether they are categorical (like an activity name) or numerical (like a timestamp).

A key advantage of SANAGRAPH is its ability to reconstruct *all* different event attributes simultaneously. Many existing model-free approaches primarily focus on repairing only a subset of attributes, such as activity labels. SANAGRAPH, by contrast, demonstrates strong performance in reconstructing a complete set of attributes, significantly enhancing data quality for Process Mining applications.

Evaluation and Performance

The researchers evaluated SANAGRAPH against a state-of-the-art autoencoder-based approach on two synthetic and four real-world event logs, including well-known datasets like BPI-2012 and BPI-2013. They tested the model under various scenarios of missing data, created by different ‘masking’ strategies (e.g., removing odd-indexed events, even-indexed events, or random events).

In terms of reconstructing missing activity labels, SANAGRAPH generally outperformed the autoencoder approach across most masking strategies, often by a significant margin. For timestamp reconstruction, the performance of both methods was closer, with each showing strengths on different datasets. Importantly, the study found that using all available event attributes for reconstruction (the ‘FULL’ configuration of SANAGRAPH) did not significantly degrade its performance on activity and timestamp reconstruction compared to using only activity and timestamp information. In fact, it often led to improvements, demonstrating the benefit of leveraging richer contextual information.

The model also showed good performance in reconstructing other categorical event attributes, with the majority achieving an accuracy higher than 80% across the real-world logs.

Also Read:

Future Directions

While SANAGRAPH represents a significant step forward in event log repair, the authors acknowledge certain limitations, such as the impact of the number of convolutional layers on performance, especially when dealing with long sequences of missing data. Future work will explore richer graph encodings and the possibility of providing explanations alongside the reconstructed traces, further enhancing the utility and transparency of the repair process.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Process Data Quality with Graph Neural Networks for Event Log Repair

Introducing SANAGRAPH: A New Approach to Event Log Repair

How SANAGRAPH Works

Evaluation and Performance

Future Directions

Gen AI News and Updates

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates