TLDR: This research compares two federated learning approaches for violence detection: LoRA-tuned Vision-Language Models (VLMs) and personalized 3D Convolutional Neural Networks (CNN3D). Both achieve over 90% accuracy, but CNN3D uses significantly less energy while slightly outperforming VLMs in some metrics. VLMs are better for complex contextual reasoning. The study proposes a hybrid model using efficient CNNs for routine tasks and selective VLM activation for complex scenarios, emphasizing sustainable and privacy-aware AI in video surveillance.
Video surveillance systems are increasingly relying on advanced artificial intelligence (AI) to detect and analyze violent incidents in public spaces. However, traditional centralized approaches, where all video data is sent to a central server, raise significant privacy concerns. Additionally, the computational and environmental costs of deploying large AI models at scale are becoming a major point of scrutiny for researchers and regulators alike.
A recent research paper, “Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs,” explores innovative solutions to these challenges. The study, conducted by Sébastien Thuau, Siba Haidar, Ayush Bajracharya, and Rachid Chelouah, delves into federated learning, a promising approach that allows AI models to be trained across multiple local devices without sharing sensitive raw data. This method enhances privacy and can reduce network load.
The core of the research compares two distinct strategies for violence detection within a federated learning framework: Vision-Language Models (VLMs) fine-tuned with a technique called Low-Rank Adaptation (LoRA), and personalized 3D Convolutional Neural Networks (CNN3D). VLMs are powerful models that can process both visual and textual information, while CNNs are a type of neural network particularly effective for image and video analysis.
The researchers used LLaVA-7B, a VLM with billions of parameters, and a more compact CNN3D model with 65.8 million parameters as their representative cases. They evaluated these models not only on their accuracy in detecting violence but also on their calibration (how well their predictions match true probabilities) and, crucially, their energy consumption and carbon dioxide (CO2) emissions. The experiments were designed to simulate realistic conditions where data is not uniformly distributed across different surveillance locations (known as non-IID settings).
The findings were compelling. Both the LoRA-tuned VLMs and the personalized CNN3Ds achieved high accuracy, exceeding 90% in violence detection. Interestingly, the more compact CNN3D model slightly outperformed the LoRA-tuned VLMs in terms of ROC AUC (a measure of a model’s ability to distinguish between classes) and log loss (a measure of prediction accuracy and confidence), all while consuming significantly less energy. For instance, the CNN3D training consumed only 240 Watt-hours (Wh) and emitted 10.1 grams of CO2 equivalent, which is less than half the energy and CO2 footprint of the LoRA fine-tuning process (570 Wh and 24 grams CO2e).
However, the study also highlighted the unique strengths of VLMs. While more resource-intensive, VLMs remain highly favorable for tasks requiring contextual reasoning and multimodal inference—meaning they can understand complex scenarios by combining visual cues with descriptive prompts. This makes them valuable for situations that demand nuanced understanding beyond simple classification.
The authors propose a “hybrid model” as an optimal solution. This framework suggests using lightweight CNNs for routine violence classification tasks due to their efficiency and strong performance. For more complex or descriptive scenarios, where deeper contextual understanding is needed, VLMs could be selectively activated. This approach offers a balanced way to achieve responsible, resource-aware AI in video surveillance, with potential extensions for real-time, multimodal, and environmentally conscious systems.
Also Read:
- How Federated Learning is Reshaping Financial Security
- New Framework Enhances Detection of Unseen Jailbreak Attacks in Vision-Language Models
This research is a significant step forward, being the first comparative study of its kind to emphasize energy efficiency and environmental metrics in federated violence detection using these two distinct AI model types. It provides a reproducible baseline for future work in sustainable and privacy-preserving AI for video surveillance. You can read the full paper here.


