spot_img
HomeResearch & DevelopmentEnhancing Process Data Quality with Graph Neural Networks for...

Enhancing Process Data Quality with Graph Neural Networks for Event Log Repair

TLDR: A new method called SANAGRAPH uses Heterogeneous Graph Neural Networks (HGNNs) to repair missing information in event logs, which are crucial for process analysis. Unlike previous methods that often focus on specific attributes, SANAGRAPH can reconstruct all types of missing event attributes, including entire events, by representing event logs as graphs and leveraging the relationships within the data. It shows strong performance in restoring missing activities and timestamps across various real-world datasets.

In the world of Process Mining, where organizations analyze and improve their operations using data from event logs, the quality and completeness of these logs are paramount. However, real-world event logs often suffer from missing or incomplete information, making accurate analysis a significant challenge. Traditional methods for repairing these logs either rely on a pre-existing process model to infer missing values or use Machine Learning (ML) and Deep Learning (DL) models to learn from complete cases. While these approaches have shown promise, they often struggle to capture the full complexity of how different pieces of information within an event log relate to each other, frequently focusing on only a subset of event attributes.

Introducing SANAGRAPH: A New Approach to Event Log Repair

A new research paper, “Graph-based Event Log Repair”, introduces SANAGRAPH, a novel method that leverages Heterogeneous Graph Neural Networks (HGNNs) to address the problem of missing information in event logs. This approach is particularly innovative because it can reconstruct the full set of attributes missing from events, even when entire events are absent from a trace. The authors of this work are Sebastiano Dissegna and Chiara Di Francescomarino from the University of Trento, and Massimiliano Ronzani from Fondazione Bruno Kessler.

Graph Neural Networks (GNNs) are a type of Deep Learning model specifically designed to handle data structured as graphs, which naturally represent relationships and dependencies. In the context of Process Mining, GNNs offer a powerful framework for modeling execution traces, where events and their attributes can be represented as nodes and edges in a graph. SANAGRAPH takes this a step further by using Heterogeneous GNNs, which allow for different types of nodes for different event attributes, creating a richer and more semantically meaningful representation of the data.

How SANAGRAPH Works

SANAGRAPH transforms event log traces into heterogeneous graphs. Each attribute of an event (like activity, timestamp, or resource) becomes a node in the graph. Missing values are marked, and the problem of repairing the log becomes a ‘node classification’ task within the graph. The HGNN model allows information to flow across the graph, enriching the ’empty’ nodes with context from their neighbors. After several layers of processing, linear layers then classify these empty nodes into their correct values, whether they are categorical (like an activity name) or numerical (like a timestamp).

A key advantage of SANAGRAPH is its ability to reconstruct *all* different event attributes simultaneously. Many existing model-free approaches primarily focus on repairing only a subset of attributes, such as activity labels. SANAGRAPH, by contrast, demonstrates strong performance in reconstructing a complete set of attributes, significantly enhancing data quality for Process Mining applications.

Evaluation and Performance

The researchers evaluated SANAGRAPH against a state-of-the-art autoencoder-based approach on two synthetic and four real-world event logs, including well-known datasets like BPI-2012 and BPI-2013. They tested the model under various scenarios of missing data, created by different ‘masking’ strategies (e.g., removing odd-indexed events, even-indexed events, or random events).

In terms of reconstructing missing activity labels, SANAGRAPH generally outperformed the autoencoder approach across most masking strategies, often by a significant margin. For timestamp reconstruction, the performance of both methods was closer, with each showing strengths on different datasets. Importantly, the study found that using all available event attributes for reconstruction (the ‘FULL’ configuration of SANAGRAPH) did not significantly degrade its performance on activity and timestamp reconstruction compared to using only activity and timestamp information. In fact, it often led to improvements, demonstrating the benefit of leveraging richer contextual information.

The model also showed good performance in reconstructing other categorical event attributes, with the majority achieving an accuracy higher than 80% across the real-world logs.

Also Read:

Future Directions

While SANAGRAPH represents a significant step forward in event log repair, the authors acknowledge certain limitations, such as the impact of the number of convolutional layers on performance, especially when dealing with long sequences of missing data. Future work will explore richer graph encodings and the possibility of providing explanations alongside the reconstructed traces, further enhancing the utility and transparency of the repair process.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -