TLDR: A new semantic communication framework for real-time urban traffic surveillance significantly reduces data transmission from edge cameras to the cloud. It uses YOLOv11 to detect vehicles, Vision Transformers (ViT) to convert cropped images into tiny embedding vectors, and then transmits these compact embeddings. On the cloud, an image decoder reconstructs the images, which a multimodal LLM (LLaVA 1.5 7B) then uses to generate traffic descriptions. This method achieves a 99.9% data reduction with only a minor drop in LLM accuracy (from 93% to 89%), making real-time monitoring more efficient over mobile networks.
Real-time urban traffic surveillance is a cornerstone of modern Intelligent Transportation Systems (ITS), crucial for maintaining road safety, optimizing traffic flow, tracking vehicle movements, and preventing collisions in our increasingly smart cities. While deploying cameras across urban environments is a standard practice for monitoring road conditions, integrating these visual feeds with advanced intelligent models presents significant challenges.
One major hurdle is the computational demand of powerful multimodal Large Language Models (LLMs). These models are excellent at interpreting traffic images and generating informative responses, but their sheer size makes them impractical for direct deployment on resource-constrained edge devices like surveillance cameras. This necessitates transmitting visual data from the edge to the cloud for LLM inference, a process often hampered by limited bandwidth, leading to potential delays that compromise real-time performance.
To address this critical challenge, researchers have proposed a novel semantic communication framework designed to drastically reduce data transmission overhead. This innovative method involves several key steps:
First, Regions of Interest (RoIs), primarily vehicles, are detected within the camera images using a YOLOv11 model. Once identified, these relevant image segments are cropped. Instead of transmitting these cropped images directly, they are converted into compact embedding vectors using a Vision Transformer (ViT). These highly compressed embeddings are then transmitted to the cloud.
Upon reaching the cloud, an image decoder reconstructs the cropped images from these embedding vectors. Finally, the reconstructed images are processed by a multimodal LLM, specifically LLaVA 1.5 7B, to generate detailed descriptions of traffic conditions.
This approach yields impressive results, achieving a remarkable 99.9% reduction in data transmission size. While this extreme compression leads to a slight drop in LLM response accuracy – from 93% with original cropped images to 89% with reconstructed images – it demonstrates a highly efficient and practical balance for real-time traffic surveillance. The framework highlights the efficiency and practicality of ViT and LLM-assisted edge–cloud semantic communication.
The end-to-end pipeline integrates YOLOv11 for vehicle detection, ViT for generating embeddings, a custom image decoder for reconstruction, and LLaVA for language-based traffic descriptions. This system significantly enhances both communication efficiency and AI-driven interpretability for smart city surveillance applications.
Further analysis of the transmission process revealed that a quantization technique for encoding embedding vectors offers superior robustness to bit errors and further reduces transmission data size compared to the standard IEEE 754 floating-point encoding. For instance, 8-bit quantization can achieve good perceptual quality at lower signal-to-noise ratios (SNRs) compared to IEEE 754.
In terms of LLM performance, LLaVA 1.5 7B was chosen for its efficiency and ability to generate concise, timely responses, which are crucial for real-time applications. While other models like LLaMA 3.2-11B Vision-Instruct can provide more detailed descriptions, their higher computational requirements and longer inference times make them less suitable for this specific real-time traffic monitoring scenario. The LLaVA model was fine-tuned using a technique called LoRA (Low-Rank Adaptation), which efficiently adapts large models to specific tasks without extensive retraining.
The data for training and evaluation was collected using the Quanser Interactive Lab, a simulation platform that acts as a digital twin of urban environments, allowing for diverse and high-quality traffic scenarios. This allowed for the collection of 2400 cropped images, each with textual captions, to train the image decoder and fine-tune the LLaVA model.
Also Read:
- Smart Vision: How AI is Enhancing Object Detection in Challenging Environments
- Smart Signals for 6G: Unifying Massive MIMO and Semantic Communication for Next-Gen Data Transmission
This research provides a strong foundation for integrating Vision Transformers and Large Language Models into semantic communication systems for traffic surveillance. Future work aims to explore a broader range of text-based queries, investigate the framework’s predictive capabilities for anticipating future events, and evaluate other vision-grounded LLMs to further enhance performance. Additionally, incorporating context awareness could further improve inference performance in real-world deployments. For more in-depth information, you can refer to the original research paper.


