Streamlining Urban Traffic Surveillance: A Semantic Communication Approach with Vision Transformers and Language Models

TLDR: A new semantic communication framework for real-time urban traffic surveillance significantly reduces data transmission from edge cameras to the cloud. It uses YOLOv11 to detect vehicles, Vision Transformers (ViT) to convert cropped images into tiny embedding vectors, and then transmits these compact embeddings. On the cloud, an image decoder reconstructs the images, which a multimodal LLM (LLaVA 1.5 7B) then uses to generate traffic descriptions. This method achieves a 99.9% data reduction with only a minor drop in LLM accuracy (from 93% to 89%), making real-time monitoring more efficient over mobile networks.

Real-time urban traffic surveillance is a cornerstone of modern Intelligent Transportation Systems (ITS), crucial for maintaining road safety, optimizing traffic flow, tracking vehicle movements, and preventing collisions in our increasingly smart cities. While deploying cameras across urban environments is a standard practice for monitoring road conditions, integrating these visual feeds with advanced intelligent models presents significant challenges.

One major hurdle is the computational demand of powerful multimodal Large Language Models (LLMs). These models are excellent at interpreting traffic images and generating informative responses, but their sheer size makes them impractical for direct deployment on resource-constrained edge devices like surveillance cameras. This necessitates transmitting visual data from the edge to the cloud for LLM inference, a process often hampered by limited bandwidth, leading to potential delays that compromise real-time performance.

To address this critical challenge, researchers have proposed a novel semantic communication framework designed to drastically reduce data transmission overhead. This innovative method involves several key steps:

First, Regions of Interest (RoIs), primarily vehicles, are detected within the camera images using a YOLOv11 model. Once identified, these relevant image segments are cropped. Instead of transmitting these cropped images directly, they are converted into compact embedding vectors using a Vision Transformer (ViT). These highly compressed embeddings are then transmitted to the cloud.

Upon reaching the cloud, an image decoder reconstructs the cropped images from these embedding vectors. Finally, the reconstructed images are processed by a multimodal LLM, specifically LLaVA 1.5 7B, to generate detailed descriptions of traffic conditions.

This approach yields impressive results, achieving a remarkable 99.9% reduction in data transmission size. While this extreme compression leads to a slight drop in LLM response accuracy – from 93% with original cropped images to 89% with reconstructed images – it demonstrates a highly efficient and practical balance for real-time traffic surveillance. The framework highlights the efficiency and practicality of ViT and LLM-assisted edge–cloud semantic communication.

The end-to-end pipeline integrates YOLOv11 for vehicle detection, ViT for generating embeddings, a custom image decoder for reconstruction, and LLaVA for language-based traffic descriptions. This system significantly enhances both communication efficiency and AI-driven interpretability for smart city surveillance applications.

Further analysis of the transmission process revealed that a quantization technique for encoding embedding vectors offers superior robustness to bit errors and further reduces transmission data size compared to the standard IEEE 754 floating-point encoding. For instance, 8-bit quantization can achieve good perceptual quality at lower signal-to-noise ratios (SNRs) compared to IEEE 754.

In terms of LLM performance, LLaVA 1.5 7B was chosen for its efficiency and ability to generate concise, timely responses, which are crucial for real-time applications. While other models like LLaMA 3.2-11B Vision-Instruct can provide more detailed descriptions, their higher computational requirements and longer inference times make them less suitable for this specific real-time traffic monitoring scenario. The LLaVA model was fine-tuned using a technique called LoRA (Low-Rank Adaptation), which efficiently adapts large models to specific tasks without extensive retraining.

The data for training and evaluation was collected using the Quanser Interactive Lab, a simulation platform that acts as a digital twin of urban environments, allowing for diverse and high-quality traffic scenarios. This allowed for the collection of 2400 cropped images, each with textual captions, to train the image decoder and fine-tune the LLaVA model.

Also Read:

This research provides a strong foundation for integrating Vision Transformers and Large Language Models into semantic communication systems for traffic surveillance. Future work aims to explore a broader range of text-based queries, investigate the framework’s predictive capabilities for anticipating future events, and evaluate other vision-grounded LLMs to further enhance performance. Additionally, incorporating context awareness could further improve inference performance in real-world deployments. For more in-depth information, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Streamlining Urban Traffic Surveillance: A Semantic Communication Approach with Vision Transformers and Language Models

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Valerann’s AI Traffic Platform Earns Dual International Accolades Amidst Ireland-Wide Rollout

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates