Predicting Warehouse Completion Times: A Comparative Study of Machine Learning Approaches

TLDR: This research paper evaluates four different remaining time prediction approaches (LSTM, SuTraN, PGTNet, and XGBoost) in a real-life outbound warehouse process of an aviation logistics company. Using a novel public event log with 169,523 traces, the study found that deep learning models like SuTraN achieved the highest accuracy, but shallow methods such as XGBoost offered competitive accuracy with significantly fewer computational resources, making them highly efficient for quick retraining and real-time predictions. PGTNet struggled with overfitting in this specific case. The findings emphasize the trade-off between accuracy and computational cost, guiding companies in selecting appropriate predictive models for their specific process characteristics.

Predicting how long a task will take until completion is a crucial aspect of managing business operations efficiently. This field, known as Predictive Process Monitoring (PPM), focuses on forecasting the future progression of ongoing processes. One common goal is to estimate the ‘remaining time’ – the duration until a process execution is finished. Accurate remaining time predictions can significantly help businesses avoid delays, improve operational efficiency, and provide better estimates to customers, especially in time-sensitive industries.

A recent study delved into this challenge within a real-life outbound warehouse process of a logistics company specializing in the aviation business. The research, titled Remaining Time Prediction in Outbound Warehouse Processes: A Case Study, compared four different approaches to remaining time prediction using a unique and publicly available event log containing 169,523 process traces.

The Warehouse Process Under Scrutiny

The case study focused on a logistics company providing services for the aviation industry, specifically handling smaller items like spare parts. The outbound warehouse process, while having a relatively straightforward and linear flow, experiences varying cycle times due to factors such as item type or weight. These variations make accurate delivery forecasts difficult, yet such forecasts are vital for customers to plan aircraft maintenance and repairs. The company provided an anonymized event log with 169,523 traces, each representing an order for a single item. This log included 24 attributes (20 categorical and 4 numerical) that could be used for prediction.

Methodology: Preparing Data and Selecting Models

Before applying predictive models, the researchers meticulously pre-processed the event log. This involved removing outliers, such as traces with impossible durations or those exceeding half a year, and filtering data to include only the current process version after a concept drift in May 2024. This reduced the dataset to 41,927 traces. Feature selection was also critical; uninformative features were removed, and a subset of eleven features with high predictive power was chosen based on Mutual Information scores. Additional features, known by process managers to be predictive (e.g., time since trace started, day of the week, number of concurrent traces), were also engineered.

The study then compared four distinct remaining time prediction approaches:

LSTM (Long Short-Term Memory): A data-aware deep learning approach known for handling sequential data.
SuTraN (Suffix Transformer Network): A novel transformer-based model utilizing encoder-decoder architecture.
PGTNet (Process Graph Transformer Network): A graph transformer-based approach that represents event logs as graphs.
XGBoost: A conventional boosting technique, serving as a less sophisticated baseline machine learning method.

These models were trained and evaluated using a 70-30 split for training and test sets, with a validation set for hyperparameter optimization. The Mean Absolute Error (MAE) was used as the primary evaluation metric.

Key Findings: Accuracy vs. Efficiency

The results revealed a trade-off between predictive accuracy and computational resources:

SuTraN achieved the lowest MAE of 554 minutes, making it the most accurate predictor. However, it also demanded the longest training time (4.65 hours) and had a relatively higher inference time (3.17 ms).
LSTM followed closely with an MAE of 568 minutes, requiring 1.26 hours for training and 0.63 ms for inference.
XGBoost, despite being a simpler model, showed competitive accuracy with an MAE of 613 minutes. Crucially, it was by far the most efficient, training in just 2 minutes and having the fastest inference time (0.10 ms).
PGTNet performed significantly worse in this specific case, with an MAE of 1390 minutes, likely due to overfitting on the given event log, suggesting its architecture might be too complex for this particular process.

Also Read:

Implications for Practice and Research

The study highlights that while deep learning models like SuTraN offer superior accuracy, simpler methods like XGBoost provide a viable alternative, especially when computational resources are limited or frequent retraining is necessary. XGBoost’s efficiency makes it highly suitable for real-time predictions and scenarios where quick model updates are essential. The researchers noted that LSTMs tended to underestimate remaining time, which could be more problematic than overestimation in customer-facing scenarios.

For practitioners, the research underscores the importance of carefully selecting a predictive approach based on the specific characteristics of the process and available resources. For researchers, the study indicates that there is still significant room for improvement in predictive approaches, as even state-of-the-art models struggled to reduce the MAE below nine hours, a substantial error compared to an average trace duration of 27 hours. The authors also made their anonymized event log publicly available, encouraging further research and experimentation within the process mining community.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting Warehouse Completion Times: A Comparative Study of Machine Learning Approaches

The Warehouse Process Under Scrutiny

Methodology: Preparing Data and Selecting Models

Key Findings: Accuracy vs. Efficiency

Implications for Practice and Research

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates