TLDR: STER-VLM is a new computationally efficient framework that enhances Vision-Language Models (VLMs) for automated traffic analysis. It improves performance by decomposing captions into spatial and temporal information, intelligently selecting video frames, using reference-driven understanding, and optimizing prompts. Validated on WTS and BDD datasets, STER-VLM achieves strong results in semantic richness and traffic scene interpretation, demonstrating its effectiveness for resource-efficient and accurate real-world applications.
Vision-language models, or VLMs, are becoming increasingly important for automated traffic analysis. These powerful AI systems combine visual perception with language understanding, enabling tasks like image captioning and visual question answering. However, a significant challenge with current VLMs is their demand for substantial computational resources and their difficulty in grasping fine-grained spatio-temporal details in complex traffic scenarios.
A new research paper introduces STER-VLM, a computationally efficient framework designed to overcome these limitations. Developed by researchers from various Vietnamese universities, STER-VLM aims to enhance VLM performance for traffic analysis without requiring excessive computing power. The framework was validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, demonstrating its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.
How STER-VLM Works: A Closer Look at its Innovations
STER-VLM incorporates several key innovations to achieve its enhanced performance and efficiency:
Caption Decomposition: The framework tackles the complexity of traffic scene descriptions by breaking down ground-truth captions into two distinct parts: spatial-invariant information (details that remain constant, like environment or object attributes) and temporal-variant information (details that change over time, such as actions or positions). This decomposition helps the model process information more effectively and align with structured annotations.
Temporal Frame Selection and Best-View Filtering: Videos, especially in multi-camera datasets, contain a vast amount of visual data. STER-VLM introduces a smart method to select the most informative frames. It uniformly samples a limited number of frames from each video phase to balance representational richness with computational efficiency. Additionally, a best-view filtering method ensures that selected frames contain the most relevant visual information, such as the largest bounding boxes for pedestrians or vehicles, to capture critical details.
Reference-Driven Understanding: To improve the model’s ability to understand subtle temporal dynamics and fine-grained differences between frames, STER-VLM leverages a large pre-trained VLM (Qwen2.5-VL-72B) to generate ‘reference captions’ for each frame. These references act as helpful hints, guiding the model’s focus on key details and boosting its confidence, especially when dealing with challenging objects or complex interactions.
Instruction Optimization: The framework employs sophisticated visual and textual prompting techniques. Visual prompts, like colored bounding boxes or gaze lines, direct the model’s attention to specific regions or objects. Textual prompts are augmented with attribute hints (e.g., gender, age, clothing for spatial details; action, attention, location for temporal details) to guide the model towards generating more accurate and contextually appropriate descriptions.
Two-Stage Training Strategy: STER-VLM utilizes a two-stage fine-tuning approach. Initially, two separate Qwen2.5-VL-7B models are fine-tuned using LoRA (Low-Rank Adaptation) to specialize in processing either spatial-invariant or temporal-variant components. A composition model then refines these outputs into comprehensive captions. This sequential adaptation ensures robust captioning capabilities before adapting to downstream tasks like Visual Question Answering (VQA), leading to more accurate and stable performance.
Also Read:
- Bridging the Latency Gap: How SpotVLM Enhances Real-time AI with Cloud-Edge Context Transfer
- AdaRing: A New Approach to Efficiently Adapt Vision-Language Models
Impact and Validation
The researchers conducted extensive experiments on the WTS and BDD datasets, which consist of thousands of pedestrian-centric traffic videos with detailed descriptions and question-answer pairs. The results demonstrate substantial gains in semantic richness and traffic scene interpretation. The framework’s ability to balance performance with computational efficiency makes it a promising solution for real-world applications in intelligent transportation systems, contributing to safer roads and more intelligent traffic analysis.
For more in-depth technical details, you can refer to the full research paper: STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models.


