Optimizing AIOps Capacity Forecasting: When to Retrain Models for Evolving Data

TLDR: This research investigates strategies for adapting AIOps capacity forecasting models to continuously changing operational data. It compares traditional periodic retraining with a novel drift-based retraining approach using the FEDD detector. The study, conducted with ING Bank’s CPU and memory utilization data, found that drift-based retraining significantly reduces the frequency of model updates (up to 67% savings) while maintaining comparable forecasting accuracy in most scenarios. However, for data exhibiting rapid, short-duration changes, periodic retraining remains superior. The findings offer practical insights for balancing model accuracy and computational efficiency in real-world AIOps systems.

In the fast-paced world of IT, ensuring that systems have enough resources to meet demand is crucial. This process, known as capacity management, used to be a manual, time-consuming task. However, with the rise of Artificial Intelligence for IT Operations (AIOps), machine learning models are now being used to predict future resource needs, such as CPU and memory utilization, automating a significant part of this critical function.

A major challenge for these AIOps forecasting models is that real-world operational data is constantly changing. Factors like shifts in customer behavior or infrastructure updates (e.g., software or hardware upgrades) can cause fundamental changes in the data, a phenomenon known as ‘concept drift’. These changes can degrade the performance and reliability of forecasting models over time, making it essential to update them regularly.

Traditionally, one common solution is ‘periodic retraining,’ where models are updated at fixed intervals, for example, every month. While this ensures models stay somewhat current, it can be computationally expensive and difficult to scale, especially when managing hundreds or thousands of different data streams (time series).

This research explores a more efficient alternative: ‘drift-based retraining.’ Instead of retraining models on a fixed schedule, this approach only updates them when a significant change in the underlying data is detected. The study investigates whether this method can achieve comparable forecasting accuracy to periodic retraining while significantly reducing computational overhead.

The researchers conducted an empirical study using a capacity forecasting model developed by ING Bank, their industry partner. This model predicts CPU and memory utilization for thousands of machines, based on historical time series data. For their experiments, they focused on 16 representative time series collected over nine months.

To detect data changes, the study employed a technique called Feature Extraction Drift Detection (FEDD). FEDD is designed for time series data and identifies drift by analyzing features extracted from the data itself, rather than continuously monitoring the model’s prediction errors. It was chosen for its efficiency and scalability, as it doesn’t require storing the entire historical data, which is a key consideration for large-scale applications.

The core of the investigation compared the performance of models retrained monthly against those retrained only when FEDD signaled a data drift. Model accuracy was measured using the Mean Absolute Scaled Error (MASE), where lower values indicate better performance. The study also quantified ‘retraining savings’ by comparing how many times models needed to be updated under each approach.

The findings revealed that drift-based retraining using FEDD led to substantial savings in retraining frequency, reducing the number of required updates by 50% to 67% across various time series. In most cases, this significant reduction in retraining overhead did not come at the cost of forecasting accuracy; the models achieved comparable, and sometimes even improved, performance compared to monthly retraining.

However, the study also identified a crucial exception. For time series exhibiting very short and sudden changes in data patterns, such as those observed for ‘Machine 3’ in the study, periodic retraining still proved to be more effective. This is because FEDD, after detecting a drift, enters a “cool-down” period to avoid false alarms, during which it might miss subsequent rapid changes. In these specific scenarios, retraining more frequently was essential to maintain maximum forecasting accuracy.

The research highlights practical implications for designing AIOps forecasting systems. While FEDD offers a scalable and efficient way to manage model updates, practitioners need to consider how to handle missing data, as FEDD is not inherently designed for it. The study suggests that for time series with frequent, sudden changes, a hybrid approach—combining periodic retraining for volatile data with drift-based retraining for more stable data—could offer the best balance of efficiency and accuracy.

Also Read:

This work provides valuable insights for software teams aiming to enhance their forecasting systems, demonstrating how intelligent retraining strategies can reduce operational costs while maintaining robust performance in dynamic IT environments. For more detailed information, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing AIOps Capacity Forecasting: When to Retrain Models for Evolving Data

Gen AI News and Updates

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

DeepBooTS: A New Approach to Robust Time-Series Forecasting Against Changing Data Patterns

Enhancing Failure Detection in Instant Payment Systems with Smart Feature Engineering

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates