Enhancing Anomaly Detection: The Synergy of Active and Transfer Learning

TLDR: This research paper investigates combining active learning and transfer learning for anomaly detection in time-series data, particularly in cloud systems. It finds that active learning performs best without data clustering when integrated with transfer learning. While active learning consistently improves model performance, the rate of improvement per sample is generally linear and modest. A key finding is that this combined approach can outperform models trained solely on in-domain data, sometimes requiring less labeled target data, although performance eventually deteriorates with excessive sampling.

In the rapidly expanding world of cloud services, ensuring high availability is paramount. This requires the swift and accurate detection of anomalies, which are often indicators of underlying system issues. However, modern cloud systems generate vast amounts of monitoring data, making manual analysis impossible. Supervised machine learning offers a solution, but it relies on large, labeled datasets, which are expensive and labor-intensive to create. This challenge is particularly acute when dealing with new systems or diverse data streams.

A recent research paper, Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data, explores a promising solution: combining active learning and transfer learning. This approach aims to reduce the need for extensive manual labeling while still achieving high model performance in detecting anomalies across different time-series datasets.

Understanding Active Learning

Active learning operates on the principle that a small, carefully selected dataset can achieve comparable model performance to a much larger one. Its primary goal is to minimize the cost of data labeling by intelligently choosing the most informative data points to label. The process is iterative: a model is trained on a small amount of labeled data, its performance is used to identify new, useful data points, these points are then labeled, and the model is retrained with the expanded dataset. Essentially, the model guides the selection of data used to improve itself.

There are three main scenarios for active learning: Stream Based Selective Sampling (examining data points one by one), Pool-Based Sampling (selecting instances from a pool based on an ‘informativeness score’), and Membership Query Synthesis (generating new instances). This research focuses on Pool-Based Sampling, where an ‘acquisition function’ is used to select data points. This function balances two key considerations: maximizing the information gained from selected points and minimizing redundant information from similar data points.

The acquisition function used in this paper, inspired by previous work, integrates two components. First, it measures a model’s uncertainty about a data point; points with high uncertainty are considered more informative. This is calculated by ranking data points based on the absolute difference between the predicted probability of being normal versus anomalous. Second, it incorporates a measure of context diversity. This prevents the selection of data points that are too similar or occur too close together in a time series, ensuring a broader range of information is added to the training set.

The Combined Approach

The framework combines active learning with transfer learning. Transfer learning involves training a base model using a labeled dataset from a domain similar to the target domain where the model will be deployed. This pre-trained model then forms the foundation for the active learning process. The process also requires a sample of unlabelled data from the target domain. Over a set number of iterations, the base model scores the unlabelled target data, ranks points by uncertainty, and selects a pre-set number of the most uncertain and diverse points. These selected points are then labeled, added to the training dataset, and the base model is retrained. This iterative cycle continues until the predetermined number of iterations is complete, resulting in a refined model.

A crucial difference in this study’s experimental design compared to some previous work is the use of separate data pools for active learning sampling and model evaluation. This means the test set remains constant throughout the active learning process, providing a more valid measure of the model’s ability to generalize to new, unseen data, even if it results in a slower observed rate of performance improvement.

Key Experiments and Findings

The researchers conducted experiments using six diverse datasets: NAB (AWS and Twitter), Yahoo (Real and Artificial), IOPS KPI, and Huawei. They used a Random Forest model with a specific feature set and evaluated performance using F1 score, Precision, and Recall, with F1 being the primary focus due to its balance between precision and recall.

Experiment 1: Clustering’s Role

Previous research suggested that clustering target data into sub-domains could benefit transfer learning for anomaly detection. This experiment investigated how clustering interacts with active learning. The findings revealed a clear trend: active learning performs best when a single cluster (effectively, no clustering) is used. While clustering might be beneficial before active learning, by the time a significant number of points are added via active learning, the best performance is consistently achieved without clustering. This suggests that active learning’s selection process is more effective when data is not fragmented across multiple clusters, which can dilute the impact of new samples and exacerbate label imbalances in smaller clusters.

Experiment 2: How Fast Does Performance Improve?

This experiment focused on the rate of model performance improvement as active learning adds more samples, specifically restricting the analysis to the single-cluster scenario based on Experiment 1’s results. The study found that model performance generally improves as active learning progresses across all datasets. However, excluding the Huawei dataset, the improvement was relatively linear and quite small per point added. The Huawei dataset was an outlier, showing significant positive impact over the first 100 samples. The researchers noted that direct comparison with some prior work is difficult due to differences in experimental design, particularly their use of a consistent test set, which provides a more robust measure of generalization.

Experiment 3: Pushing the Limits of Active Learning

The final experiment explored how model performance changes when a very large number of samples are added through active learning, even up to 100% of the target training set. It also compared this combined approach to training a model purely on in-domain data. The results showed that system performance continues to improve with active learning up to a significant number of samples, but eventually, it begins to deteriorate. A potential explanation for this ‘tail-off’ is that as nearly all available samples are added, the active learning process is forced to include less useful points, potentially shifting the model’s learned distribution back towards the source domain and away from the target’s unique characteristics.

Crucially, the study found that transfer learning combined with active learning can outperform systems trained solely on in-domain data. For all datasets, the best performance was achieved with the combined approach. This is particularly interesting because, in some cases, the combined method achieved superior performance using less target domain training data than the purely in-domain approach.

Also Read:

Conclusion

The research concludes that while transfer learning alone might benefit from multiple clusters, combining it with active learning yields the best results when clustering is removed from the process. The per-sample improvement from active learning is generally stable but small, with the Huawei dataset being a notable exception. Most importantly, the study demonstrates that the synergy of transfer learning and active learning can surpass the performance of models trained exclusively on target-domain data, sometimes even with less labeled target data. This highlights the potential of these combined techniques to make anomaly detection in complex cloud systems more efficient and effective by reducing the reliance on extensive manual data labeling.