Enhancing Privacy and Utility in Process Mining Through Event Data Partitioning

TLDR: A new research paper introduces a pipeline that improves privacy-aware process discovery by combining event data partitioning with anonymization. By segmenting event logs into sub-logs and anonymizing them separately, the approach significantly reduces the noise needed for privacy protection, leading to better quality process models, especially for directly-follows-based anonymization techniques. The study shows that partitioning *before* anonymization is generally more effective, though it introduces trade-offs like increased complexity and potential minor information loss.

Information systems are the backbone of modern business operations, recording every step of a process in what are known as event logs. These logs are invaluable for “process mining,” a field that extracts insights to improve efficiency and understanding of how work flows. However, these same event logs often contain highly sensitive information about customers, patients, and employees, posing significant privacy challenges.

Traditional methods of protecting this sensitive data involve “anonymization,” which aims to obscure individual details while still allowing for useful analysis. The challenge lies in striking a delicate balance: the more complex an event log, the harder it is to anonymize without losing valuable insights, a concept known as “utility.”

A new research paper, titled “The Impact of Event Data Partitioning on Privacy-aware Process Discovery,” proposes an innovative solution to this dilemma. The authors, Jungeun Lim, Stephan A. Fahrenkrog-Petersen, Xixi Lu, Jan Mendling, and Minseok Song, introduce a pipeline that combines anonymization with “event data partitioning.” This approach leverages “event abstraction” to segment large, complex event logs into smaller, more manageable sub-logs. Each of these sub-logs can then be anonymized separately, a method designed to preserve privacy while significantly reducing the loss of utility.

The Core Idea: Reducing Noise Through Segmentation

The fundamental hypothesis behind this work is that by breaking down an event log into its constituent sub-processes, the amount of “noise” required for anonymization can be drastically reduced. Noise is often inserted into event logs to provide “differential privacy,” a strong privacy guarantee. When noise is applied to an entire, unstructured log, it can introduce unrealistic behaviors, severely impacting the quality of the discovered process models. However, if the process is first understood in terms of its sub-processes (e.g., Sub-process A, Sub-process B), noise can be applied more precisely to each sub-process independently. This targeted approach minimizes the introduction of irrelevant or misleading data, thereby enhancing the utility of the anonymized log for process discovery.

How the Pipeline Works

The proposed pipeline involves two main steps: first, event data partitioning, and then anonymization. Event abstraction is key to partitioning, where low-level activities are mapped to higher-level activities, effectively identifying sub-processes. The original event log is decomposed into multiple sub-logs, one for the higher-level activities and others for the activities within each sub-process. These sub-logs are then independently anonymized using differential privacy mechanisms. A crucial aspect of differential privacy, “parallel composition,” allows these independent anonymizations to maintain the overall privacy guarantee.

Evaluating the Impact: Key Findings

The researchers conducted extensive evaluations using three real-world event logs (BPIC2012, BPIC2015, and BPIC2017) and two common process discovery techniques (Inductive Miner and Heuristic Miner). They investigated two primary questions:

1. How does partitioning before anonymization affect utility?

When using “directly-follows-based” anonymization techniques, such as DF-Laplace, the impact of partitioning was significant. While fitness (how well the discovered model matches the log) remained similar, precision (how well the model avoids allowing behaviors not in the log) saw substantial improvements. This boost in precision led to higher F1-scores (a balance of fitness and precision), indicating that partitioning before anonymization can indeed enhance the quality of discovered process models.

However, for “trace-variant-based” anonymization techniques like SaCoFa, the benefits of partitioning were less pronounced. SaCoFa already maintains relatively stable precision levels, so partitioning did not offer the same dramatic improvements observed with DF-Laplace. In some cases, partitioning helped stabilize the quality of anonymization, reducing variation in precision.

2. How does the order of partitioning and anonymization impact utility?

For DF-Laplace, performing partitioning before anonymization consistently yielded better performance across all evaluation metrics, especially precision. This suggests that pre-processing the data by partitioning helps preserve more valuable information during the anonymization process. For SaCoFa, the order had minimal impact, reinforcing the observation that SaCoFa’s inherent design already handles precision well.

Considering the Trade-offs

While promising, the proposed pipeline also introduces certain trade-offs. Event log partitioning itself can lead to some information loss, which must be weighed against the utility gains from reduced noise. Additionally, the increased number of choices—selecting both an anonymization technique and an abstraction technique for partitioning—adds complexity, requiring more domain expertise to find the optimal configuration. Finally, while differential privacy guarantees individual privacy, anonymizing sub-processes independently might allow an adversary to infer more about a specific sub-process than if the entire log were anonymized as a whole, though the overall privacy guarantee remains.

Also Read:

Looking Ahead

This research marks a significant step in privacy-aware process mining, demonstrating that strategic pre-processing like event data partitioning can unlock higher utility in anonymized event logs, particularly for directly-follows-based anonymization techniques. Future work aims to explore other pre-processing techniques and develop a framework to guide users in selecting the most effective combination of pre-processing and anonymization based on the specific characteristics of their event logs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Privacy and Utility in Process Mining Through Event Data Partitioning

The Core Idea: Reducing Noise Through Segmentation

How the Pipeline Works

Evaluating the Impact: Key Findings

Considering the Trade-offs

Looking Ahead

Gen AI News and Updates

Hybrid Federated Learning Secures Omics Data While Boosting Performance

Optimizing City Traffic: A Balanced Approach to Efficiency, Fairness, and Privacy

PECL: Enhancing Human Activity Recognition with Multi-Domain Radar Sensing

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates