TLDR: A new framework called CapRecover can directly extract sensitive information like image labels and captions from the internal data (intermediate features) of Vision-Language Models (VLMs), even without reconstructing the original image. This poses a significant privacy risk, especially in cloud-edge AI deployments where data is split between devices and the cloud. The research also proposes a simple noise-based defense mechanism that can effectively prevent this information leakage without additional training costs.
As artificial intelligence, particularly Vision-Language Models (VLMs), becomes more integrated into our daily lives through various applications, a common deployment strategy involves splitting these complex models. This means part of the model, like the visual encoder, runs on your personal device, while only intermediate data, or ‘features,’ are sent to the cloud for further processing. While this setup helps reduce the amount of data sent over the internet, it introduces a significant privacy concern: these intermediate features can contain sensitive information.
Previous attempts to understand what these features reveal often focused on reconstructing blurry images, which didn’t always clearly show the original content. However, a crucial question remained largely unanswered: could an attacker directly recover high-level semantic content, such as image labels or captions, without needing to reconstruct the image at all?
Introducing CapRecover: A New Approach to Data Inversion
A new research paper titled CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models by Kedong Xiu and Saiqian Zhang addresses this very gap. They propose CapRecover, a groundbreaking framework that directly decodes semantic information from these intermediate features. This means it can figure out what an image is about or even generate a caption for it, all without ever reconstructing a single pixel of the original image.
CapRecover isn’t just limited to Vision-Language Models; it can also be used to reverse-engineer traditional neural networks commonly used for computer vision tasks, such as ResNet and ViT models.
How Effective is CapRecover?
The researchers put CapRecover to the test across various widely used datasets and AI models. The results are quite striking. CapRecover demonstrated its ability to accurately recover both image labels and captions. For instance, on the CIFAR-10 dataset, it achieved an impressive 92.71% Top-1 accuracy for label recovery. When generating captions from ResNet50’s intermediate features on the COCO2017 dataset, it produced fluent and relevant captions, with ROUGE-L scores reaching up to 0.52.
An interesting finding from their analysis of ResNet-based models is that deeper layers within these networks encode significantly more semantic information, while shallower layers contribute minimally to this kind of information leakage. This suggests that the deeper the AI processes an image, the more ‘understandable’ its internal data becomes to an attacker.
Understanding the Threat
The threat model considered by the researchers assumes an attacker can intercept or obtain these intermediate visual features. This could happen if data is transmitted from your device to a cloud server, or if a malicious program on your device extracts these features from memory. The attacker doesn’t need access to the original image or the final output of the AI model; just these intermediate steps are enough to infer sensitive content.
A Simple Yet Effective Defense
Recognizing the privacy implications, the researchers also propose a straightforward and effective protection approach. This method involves adding random noise to the intermediate image features at each middle layer of the model and then removing that same noise in the subsequent layer. Their experiments show that this technique effectively prevents information leakage without requiring any additional training costs for the AI model. This ‘local-only’ noise handling ensures that the noise is neither stored nor transmitted, making it a practical solution for edge-cloud systems.
The paper also briefly discusses the potential of Homomorphic Encryption (HE) as a more advanced cryptographic approach to further secure these intermediate features, allowing computations on encrypted data without revealing the plaintext.
Also Read:
- Unmasking VLM Vulnerabilities: How Text2VLM Tests AI Safety with Images
- New Method Extends AI Safety from Text to Images
Conclusion
CapRecover highlights a critical vulnerability in how Vision-Language Models are deployed, demonstrating that sensitive information can be directly extracted from their internal workings. The research not only exposes this new form of feature inversion attack but also offers a practical, low-cost defense mechanism, paving the way for more secure AI applications in the future.


