TLDR: This research introduces the SOREC dataset, a new benchmark with 100,000 pairs of referring expressions and bounding boxes for extremely small objects in driving scenarios. It also proposes PIZA, a progressive-iterative zooming adapter for parameter-efficient fine-tuning. PIZA enables models to progressively zoom in and localize small objects, significantly improving accuracy on the SOREC dataset, especially for autonomous driving applications, while being highly efficient compared to full model fine-tuning.
Referring Expression Comprehension (REC) is a fascinating area of artificial intelligence that teaches computers to find specific objects in an image based on a natural language description. Imagine telling a self-driving car, “Find the small red traffic cone near the broken white line,” and it accurately points it out. While REC models have become quite good at this, they still face a significant hurdle: accurately identifying and localizing extremely small objects.
This challenge is particularly critical in real-world applications like autonomous driving, where missing a small obstacle or traffic sign could have serious consequences. Current REC datasets and methods often struggle with these tiny objects, leading to a gap in performance.
A new research paper, “Referring Expression Comprehension for Small Objects,” by Kanoko Goto, Takumi Hirose, Mahiro Ukai, Shuhei Kurita, and Nakamasa Inoue, tackles this very problem. The researchers introduce two major contributions to advance the field: a new dataset specifically designed for small objects and a novel method for training models to find them more effectively. You can read the full paper here: Referring Expression Comprehension for Small Objects.
Introducing the SOREC Dataset
The first contribution is the Small Object REC (SOREC) dataset. This extensive dataset comprises 100,000 pairs of natural language descriptions and corresponding bounding boxes for extremely small objects found in various driving environments, including roads, highways, rural areas, and off-road scenes. What makes SOREC unique is the size of these objects; a typical bounding box in the dataset occupies only about 0.05% of the entire image area. This is significantly smaller than objects in existing popular REC datasets like RefCOCO, where objects are generally much larger.
To accurately describe these tiny targets, the referring expressions in SOREC are also much longer and more detailed, averaging 25.5 words compared to about 3.5 words in other datasets. This requires models to understand more complex language and spatial relationships to pinpoint the correct object. The dataset was created using a semi-automatic process, combining advanced segmentation tools with human crowdsourcing for quality control, ensuring high-quality annotations for these challenging small objects.
The Progressive-Iterative Zooming Adapter (PIZA)
The second key innovation is the Progressive-Iterative Zooming Adapter (PIZA). This is a clever adapter module designed for parameter-efficient fine-tuning. In simple terms, PIZA allows existing large vision-language models, like GroundingDINO, to learn how to “zoom in” on small objects progressively and iteratively without needing to retrain the entire model from scratch. This makes the fine-tuning process much more efficient.
PIZA works by modeling the object localization as a search process, where the model predicts a sequence of increasingly tighter bounding boxes, effectively zooming into the target. It learns “zooming-step embeddings” that guide this process, deciding whether to continue zooming or to stop when the object is localized. This autoregressive approach means the model uses its previous “zoom” to inform the next, leading to precise localization of tiny objects.
Experimental Success and Efficiency
The researchers applied PIZA to GroundingDINO, a powerful pre-trained model, and conducted a series of experiments on the SOREC dataset. The results were impressive: PIZA significantly improved accuracy across all tested parameter-efficient fine-tuning methods (CoOp, LoRA, and Adapter+). Notably, PIZA-Adapter+ achieved the best performance, even surpassing full fine-tuning while using drastically fewer learnable parameters (3.5 million compared to 173 million).
This demonstrates that PIZA not only makes models better at finding small objects but also does so with remarkable efficiency, making it practical for real-world deployment. The study also showed that two to three zooming steps were typically sufficient for accurate localization. Furthermore, PIZA outperformed greedy approaches like sliding windows and tile-grid methods, which are computationally expensive and prone to false positives.
Also Read:
- Advancing 3D Scene Understanding for Autonomous Driving with Progressive Gaussian Transformers
- Improving Multi-modal Video AI Fine-Tuning with Oracle Ranking
Conclusion
The SOREC dataset and the PIZA method represent a significant step forward for Referring Expression Comprehension, particularly for small objects in critical applications like autonomous driving. By providing a dedicated dataset and an efficient, effective fine-tuning approach, this research paves the way for more robust and reliable object localization in complex environments.


