TLDR: A new research paper introduces a pipeline that improves privacy-aware process discovery by combining event data partitioning with anonymization. By segmenting event logs into sub-logs and anonymizing them separately, the approach significantly reduces the noise needed for privacy protection, leading to better quality process models, especially for directly-follows-based anonymization techniques. The study shows that partitioning *before* anonymization is generally more effective, though it introduces trade-offs like increased complexity and potential minor information loss.
Information systems are the backbone of modern business operations, recording every step of a process in what are known as event logs. These logs are invaluable for “process mining,” a field that extracts insights to improve efficiency and understanding of how work flows. However, these same event logs often contain highly sensitive information about customers, patients, and employees, posing significant privacy challenges.
Traditional methods of protecting this sensitive data involve “anonymization,” which aims to obscure individual details while still allowing for useful analysis. The challenge lies in striking a delicate balance: the more complex an event log, the harder it is to anonymize without losing valuable insights, a concept known as “utility.”
A new research paper, titled “The Impact of Event Data Partitioning on Privacy-aware Process Discovery,” proposes an innovative solution to this dilemma. The authors, Jungeun Lim, Stephan A. Fahrenkrog-Petersen, Xixi Lu, Jan Mendling, and Minseok Song, introduce a pipeline that combines anonymization with “event data partitioning.” This approach leverages “event abstraction” to segment large, complex event logs into smaller, more manageable sub-logs. Each of these sub-logs can then be anonymized separately, a method designed to preserve privacy while significantly reducing the loss of utility.
The Core Idea: Reducing Noise Through Segmentation
The fundamental hypothesis behind this work is that by breaking down an event log into its constituent sub-processes, the amount of “noise” required for anonymization can be drastically reduced. Noise is often inserted into event logs to provide “differential privacy,” a strong privacy guarantee. When noise is applied to an entire, unstructured log, it can introduce unrealistic behaviors, severely impacting the quality of the discovered process models. However, if the process is first understood in terms of its sub-processes (e.g., Sub-process A, Sub-process B), noise can be applied more precisely to each sub-process independently. This targeted approach minimizes the introduction of irrelevant or misleading data, thereby enhancing the utility of the anonymized log for process discovery.
How the Pipeline Works
The proposed pipeline involves two main steps: first, event data partitioning, and then anonymization. Event abstraction is key to partitioning, where low-level activities are mapped to higher-level activities, effectively identifying sub-processes. The original event log is decomposed into multiple sub-logs, one for the higher-level activities and others for the activities within each sub-process. These sub-logs are then independently anonymized using differential privacy mechanisms. A crucial aspect of differential privacy, “parallel composition,” allows these independent anonymizations to maintain the overall privacy guarantee.
Evaluating the Impact: Key Findings
The researchers conducted extensive evaluations using three real-world event logs (BPIC2012, BPIC2015, and BPIC2017) and two common process discovery techniques (Inductive Miner and Heuristic Miner). They investigated two primary questions:
1. How does partitioning before anonymization affect utility?
When using “directly-follows-based” anonymization techniques, such as DF-Laplace, the impact of partitioning was significant. While fitness (how well the discovered model matches the log) remained similar, precision (how well the model avoids allowing behaviors not in the log) saw substantial improvements. This boost in precision led to higher F1-scores (a balance of fitness and precision), indicating that partitioning before anonymization can indeed enhance the quality of discovered process models.
However, for “trace-variant-based” anonymization techniques like SaCoFa, the benefits of partitioning were less pronounced. SaCoFa already maintains relatively stable precision levels, so partitioning did not offer the same dramatic improvements observed with DF-Laplace. In some cases, partitioning helped stabilize the quality of anonymization, reducing variation in precision.
2. How does the order of partitioning and anonymization impact utility?
For DF-Laplace, performing partitioning before anonymization consistently yielded better performance across all evaluation metrics, especially precision. This suggests that pre-processing the data by partitioning helps preserve more valuable information during the anonymization process. For SaCoFa, the order had minimal impact, reinforcing the observation that SaCoFa’s inherent design already handles precision well.
Considering the Trade-offs
While promising, the proposed pipeline also introduces certain trade-offs. Event log partitioning itself can lead to some information loss, which must be weighed against the utility gains from reduced noise. Additionally, the increased number of choices—selecting both an anonymization technique and an abstraction technique for partitioning—adds complexity, requiring more domain expertise to find the optimal configuration. Finally, while differential privacy guarantees individual privacy, anonymizing sub-processes independently might allow an adversary to infer more about a specific sub-process than if the entire log were anonymized as a whole, though the overall privacy guarantee remains.
Also Read:
- Balancing Transparency and Privacy in AI: A Deep Dive into Protecting Explanations
- Recommender Systems Adapt to Privacy Era with New Churn-Aware Model
Looking Ahead
This research marks a significant step in privacy-aware process mining, demonstrating that strategic pre-processing like event data partitioning can unlock higher utility in anonymized event logs, particularly for directly-follows-based anonymization techniques. Future work aims to explore other pre-processing techniques and develop a framework to guide users in selecting the most effective combination of pre-processing and anonymization based on the specific characteristics of their event logs.


