TLDR: COXNet is a new framework for detecting tiny objects in RGBT (visible and thermal) drone imagery. It uses a Cross-Layer Fusion Module to combine high-level visible and low-level thermal features, a Dynamic Alignment and Scale Refinement module to correct misalignments and handle varying object sizes, and a GeoShape-based label assignment for precise localization. This approach significantly improves detection accuracy for small, occluded objects while maintaining efficiency, making it suitable for real-time drone applications.
Detecting small objects in images captured by drones, especially when combining visible light (Red-Green-Blue) and thermal infrared data (RGBT), presents a significant challenge in computer vision. These “tiny objects” are hard to spot due to their small size, blending into backgrounds, and issues like spatial misalignment between the visible and thermal cameras, low-light conditions, and obstructions. Traditional methods often struggle to effectively combine the complementary information from these two different types of imagery.
Addressing these critical issues, researchers have introduced a novel framework called COXNet. This new system is specifically designed for RGBT tiny object detection and brings three key innovations to the forefront, aiming to improve accuracy and efficiency in complex environments like those encountered by drones.
Cross-Layer Fusion for Enhanced Detail
The first innovation is the Cross-Layer Fusion Module (CLFM). Unlike conventional approaches that merge features from similar processing stages, CLFM intelligently combines high-level visible features with low-level thermal features. This unique fusion strategy enhances both semantic understanding (what the object is) and spatial accuracy (where the object is). It achieves this by using a wavelet-based alignment technique, which effectively separates and combines different frequency components of the images. This allows COXNet to precisely align and fuse complementary information from visible and thermal modalities, preserving fine details crucial for tiny objects while reducing computational complexity.
Dynamic Alignment and Scale Refinement
The second core component is the Dynamic Alignment and Scale Refinement (DASR) module. This module is crucial for correcting spatial misalignments between the visible and thermal data and for handling objects of varying sizes. DASR consists of two parts: the Adaptive Alignment Module (AAM) and the Dynamic Scale Refinement (DSR) mechanism. AAM dynamically adjusts the positions of visible and thermal features at a pixel level, ensuring they correspond accurately. DSR, on the other hand, uses different-sized convolution kernels and dynamic weighting to adjust feature scales, effectively capturing both fine-grained details and broader contextual information. This dual approach ensures precise alignment and robust multi-scale feature adjustment, which is particularly beneficial for drone-based detection in challenging conditions.
Optimized Label Assignment for Precision
Finally, COXNet introduces an optimized label assignment strategy that utilizes the GeoShape Similarity Measure. Traditional methods often rely on Intersection over Union (IoU) for assigning labels, which can be overly sensitive to small shifts, especially for tiny objects. The GeoShape Similarity Measure is more robust because it considers not only the overlap but also the spatial and shape characteristics of the bounding boxes. This ensures a more accurate and adaptive assignment process, significantly improving the localization accuracy of tiny objects even under difficult circumstances.
Also Read:
- GRASPTrack: Enhancing Multi-Object Tracking with 3D Geometric Understanding
- New Approach Reduces False Alarms in Infrared Small Target Detection
Performance and Efficiency
Extensive experiments were conducted on several challenging datasets, including RGBTDronePerson, VTUAV-det, and NII-CU. The results consistently show that COXNet significantly outperforms existing state-of-the-art methods. For instance, on the RGBTDronePerson dataset, COXNet achieved a notable 3.32% improvement in mAP 50 (mean Average Precision at 50% IoU) over previous leading models. Crucially, despite its enhanced accuracy, COXNet maintains competitive efficiency, making it suitable for real-time applications in resource-constrained environments like drone-based surveillance. This balance between high detection accuracy and computational demand positions COXNet as a leading solution for RGBT tiny object detection.
The effectiveness of COXNet’s architectural innovations, including the DASR and CLFM modules, is evident in its ability to enhance detection capabilities while managing resource requirements. This makes it an optimal choice for real-world scenarios where both precision and real-time performance are paramount. For more technical details, you can refer to the full research paper here.


