Unmasking Watermark Vulnerabilities: A New Method for Forging Digital Signatures on Images

TLDR: Researchers have developed a new “black-box” method to forge digital watermarks on images, requiring only a single watermarked image and no knowledge of the original watermarking model. This technique uses a preference model, trained on synthetically altered images, to identify and replicate watermark patterns, highlighting significant security flaws in current post-hoc watermarking systems and urging for more robust content-aware detection mechanisms.

Digital watermarking has become increasingly important in recent years, especially with the rise of AI-generated content. It helps ensure content authenticity and attribution by embedding imperceptible signals into images. While much research has focused on removing watermarks, the act of “watermark forging”—stealing a watermark from genuine content and applying it to malicious content—has remained largely unexplored.

A new research paper, titled “Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models,” delves into this critical security vulnerability. Authored by Tomáš Souˇcek, Sylvestre-Alvise Rebuffi, Pierre Fernandez, Nikola Jovanovi´c, Hady Elsahar, Valeriu Lacatusu, Tuan Tran, and Alexandre Mourachko from Meta FAIR and ETH Zurich, this work introduces a novel approach to investigate and demonstrate watermark forging in post-hoc image watermarking.

The core of their contribution lies in a new method that makes watermark forging simpler and more practical than previous attempts. Unlike other attacks that require extensive access to watermarked data or the watermarking model itself, this new technique needs only a single watermarked image and no prior knowledge of how the watermark was created. This makes it a “black-box” attack, mimicking real-world adversarial scenarios.

Understanding the Attack Mechanism

The researchers developed a “preference model” to determine if an image is watermarked. This model is trained using a ranking loss on purely procedurally generated images, meaning it doesn’t need actual watermarked content for its training. It learns to identify subtle “unnatural” artifacts that indicate the presence of a watermark.

Once trained, this preference model becomes a powerful tool. By optimizing an input image through a process called backpropagation, the model can be used to either remove an existing watermark or forge one onto a new image. The process involves maximizing the preference score of the model, effectively guiding the image modification to either appear “clean” (non-watermarked) or “watermarked.”

The attack pipeline is quite elegant: given a single watermarked image, the system first estimates the embedded watermark. This estimated watermark can then be applied to any new image, making it appear genuinely watermarked to detection systems. The goal is to make these modifications imperceptible, ensuring the forged image looks realistic.

Key Contributions and Implications

The paper highlights three main contributions:

A novel image preference model trained on synthetically perturbed images, eliminating the need for real watermarked data.
A gradient-based attack procedure that uses this preference model to remove or forge watermarks through direct image pixel optimization, without needing to know the original watermarking scheme.
Comprehensive evaluations across various post-hoc image watermarking models, demonstrating the effectiveness of their forging approach and providing insights into which watermarking methods are more robust.

The findings are significant because they question the security of many current post-hoc watermarking approaches. While some content-aware watermarking methods show resistance, others can be easily exploited. The researchers emphasize that their method provides a more realistic assessment of vulnerabilities in the wild, as it operates under practical, low-resource, black-box conditions.

Comparison with Existing Methods

The new method outperforms many prior works in watermark forging, especially for watermarking schemes where the watermark is highly dependent on the image content, like Video Seal. Traditional methods like “image averaging” might work for static watermarks but fail when watermarks are dynamic and content-aware. For watermark removal, their approach is competitive, producing high-quality images with effectively removed watermark information, without the “hallucination” of details seen in some diffusion-based methods.

Also Read:

Limitations and Future Directions

The attack primarily targets post-hoc watermarking methods, not semantic watermarking techniques that alter objects or their locations in AI-generated images. The method may also cause some blurring in areas with natural high-frequency textures, though this could be mitigated with improved training. The authors recommend that watermarking developers ensure their decoders are truly content-aware and explicitly trained to reject watermarks from different source images to strengthen future techniques. You can read the full paper here: Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models.

This research serves as a crucial warning and a call to action for the digital watermarking community, pushing for more robust and secure solutions in an era increasingly dominated by AI-generated content.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking Watermark Vulnerabilities: A New Method for Forging Digital Signatures on Images

Understanding the Attack Mechanism

Key Contributions and Implications

Comparison with Existing Methods

Limitations and Future Directions

Gen AI News and Updates

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

TrojAI Unveils Defend for MCP to Bolster Security for AI Agent Workflows

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates