spot_img
HomeResearch & DevelopmentOptimizing AIOps Capacity Forecasting: When to Retrain Models for...

Optimizing AIOps Capacity Forecasting: When to Retrain Models for Evolving Data

TLDR: This research investigates strategies for adapting AIOps capacity forecasting models to continuously changing operational data. It compares traditional periodic retraining with a novel drift-based retraining approach using the FEDD detector. The study, conducted with ING Bank’s CPU and memory utilization data, found that drift-based retraining significantly reduces the frequency of model updates (up to 67% savings) while maintaining comparable forecasting accuracy in most scenarios. However, for data exhibiting rapid, short-duration changes, periodic retraining remains superior. The findings offer practical insights for balancing model accuracy and computational efficiency in real-world AIOps systems.

In the fast-paced world of IT, ensuring that systems have enough resources to meet demand is crucial. This process, known as capacity management, used to be a manual, time-consuming task. However, with the rise of Artificial Intelligence for IT Operations (AIOps), machine learning models are now being used to predict future resource needs, such as CPU and memory utilization, automating a significant part of this critical function.

A major challenge for these AIOps forecasting models is that real-world operational data is constantly changing. Factors like shifts in customer behavior or infrastructure updates (e.g., software or hardware upgrades) can cause fundamental changes in the data, a phenomenon known as ‘concept drift’. These changes can degrade the performance and reliability of forecasting models over time, making it essential to update them regularly.

Traditionally, one common solution is ‘periodic retraining,’ where models are updated at fixed intervals, for example, every month. While this ensures models stay somewhat current, it can be computationally expensive and difficult to scale, especially when managing hundreds or thousands of different data streams (time series).

This research explores a more efficient alternative: ‘drift-based retraining.’ Instead of retraining models on a fixed schedule, this approach only updates them when a significant change in the underlying data is detected. The study investigates whether this method can achieve comparable forecasting accuracy to periodic retraining while significantly reducing computational overhead.

The researchers conducted an empirical study using a capacity forecasting model developed by ING Bank, their industry partner. This model predicts CPU and memory utilization for thousands of machines, based on historical time series data. For their experiments, they focused on 16 representative time series collected over nine months.

To detect data changes, the study employed a technique called Feature Extraction Drift Detection (FEDD). FEDD is designed for time series data and identifies drift by analyzing features extracted from the data itself, rather than continuously monitoring the model’s prediction errors. It was chosen for its efficiency and scalability, as it doesn’t require storing the entire historical data, which is a key consideration for large-scale applications.

The core of the investigation compared the performance of models retrained monthly against those retrained only when FEDD signaled a data drift. Model accuracy was measured using the Mean Absolute Scaled Error (MASE), where lower values indicate better performance. The study also quantified ‘retraining savings’ by comparing how many times models needed to be updated under each approach.

The findings revealed that drift-based retraining using FEDD led to substantial savings in retraining frequency, reducing the number of required updates by 50% to 67% across various time series. In most cases, this significant reduction in retraining overhead did not come at the cost of forecasting accuracy; the models achieved comparable, and sometimes even improved, performance compared to monthly retraining.

However, the study also identified a crucial exception. For time series exhibiting very short and sudden changes in data patterns, such as those observed for ‘Machine 3’ in the study, periodic retraining still proved to be more effective. This is because FEDD, after detecting a drift, enters a “cool-down” period to avoid false alarms, during which it might miss subsequent rapid changes. In these specific scenarios, retraining more frequently was essential to maintain maximum forecasting accuracy.

The research highlights practical implications for designing AIOps forecasting systems. While FEDD offers a scalable and efficient way to manage model updates, practitioners need to consider how to handle missing data, as FEDD is not inherently designed for it. The study suggests that for time series with frequent, sudden changes, a hybrid approach—combining periodic retraining for volatile data with drift-based retraining for more stable data—could offer the best balance of efficiency and accuracy.

Also Read:

This work provides valuable insights for software teams aiming to enhance their forecasting systems, demonstrating how intelligent retraining strategies can reduce operational costs while maintaining robust performance in dynamic IT environments. For more detailed information, you can refer to the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -