TLDR: A new research paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking (MMVOT) that combines RGBT, RGBD, and RGBE data, addressing inconsistencies in current training and testing paradigms and reducing evaluation time by 27%. The paper also proposes a “serial unification” approach, integrating new tasks progressively with continual learning (CL) to mitigate performance degradation caused by knowledge forgetting. The study reveals that degradation is linked to network capacity and modality discrepancies, with larger networks and less disparate modalities showing better performance.
Visual object tracking, which involves continuously predicting an object’s location and scale in a video, is increasingly relying on multiple data sources, known as multi-modal visual object tracking (MMVOT). Different modalities like thermal infrared (T), depth (D), and event (E) data offer unique advantages over traditional visible light (RGB) alone, enhancing robustness in various challenging environments. This has led to a growing interest in combining these strengths into a single, unified tracking system.
However, current approaches to unifying these MMVOT tasks often face a significant challenge. Existing methods typically mix all types of data—such as RGBT (RGB+Thermal), RGBD (RGB+Depth), and RGBE (RGB+Event)—into a single training process. This is referred to as a “parallel” training paradigm. While aiming for a comprehensive model, this approach creates an inconsistency: the model is trained on a mix of data but then evaluated separately on individual benchmarks for each modality. This mismatch between training and testing often leads to a noticeable drop in performance.
To address these critical issues, a recent research paper, “Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking”, introduces two key advancements. The first is a new unified benchmark called UniBench300. This benchmark is designed to bridge the gap between training and testing by incorporating RGBT, RGBD, and RGBE data simultaneously. UniBench300 consists of 300 video sequences, with 100 sequences for each of the RGBT, RGBD, and RGBE tasks, totaling 368.1K frames. By providing a single platform for evaluation, UniBench300 not only resolves the inconsistency but also significantly improves efficiency, reducing the inference time by approximately 27% compared to evaluating on separate benchmarks.
The second major advancement is the reformulation of the unification process itself. Instead of the traditional parallel approach, the researchers propose a “serial” unification method. This involves progressively integrating new tasks into the model. This serial approach naturally aligns with the concept of continual learning (CL), a field focused on enabling models to learn new information without forgetting previously acquired knowledge. By applying CL techniques, the performance degradation, which can now be understood as knowledge forgetting of prior tasks, is significantly mitigated. Experiments conducted on two baseline tracking methods, ViPT and SymTrack, across multiple benchmarks, demonstrate that incorporating continual learning leads to a more stable and superior unification process.
The study also provides valuable insights into why performance degradation occurs after unification. One key finding is that performance degradation is negatively correlated with the network’s capacity; larger networks tend to experience less degradation. This suggests that models with more parameters are better equipped to handle the complexity of integrating diverse multi-modal knowledge. Additionally, the researchers found that the level of degradation varies across tasks due to differences in modality discrepancies. For instance, RGBT tracking, which combines RGB with thermal data, experiences greater degradation than RGBD (RGB+Depth) or RGBE (RGB+Event) tracking. This is attributed to the greater dissimilarity between thermal and RGB data compared to depth or event data, offering crucial guidance for future multi-modal vision research.
Also Read:
- Evaluating Continuous Learning in Multimodal AI: Introducing MLLM-CTBench
- Assessing Emotional Intelligence in Large Language Models: Introducing MME-Emotion
In summary, this work presents UniBench300 as a vital tool for consistent and efficient evaluation in multi-modal tracking, and it champions a serial unification paradigm powered by continual learning to overcome performance degradation. These contributions pave the way for more robust and adaptable multi-modal visual object tracking systems.


