TLDR: Researchers have developed a new “black-box” method to forge digital watermarks on images, requiring only a single watermarked image and no knowledge of the original watermarking model. This technique uses a preference model, trained on synthetically altered images, to identify and replicate watermark patterns, highlighting significant security flaws in current post-hoc watermarking systems and urging for more robust content-aware detection mechanisms.
Digital watermarking has become increasingly important in recent years, especially with the rise of AI-generated content. It helps ensure content authenticity and attribution by embedding imperceptible signals into images. While much research has focused on removing watermarks, the act of “watermark forging”—stealing a watermark from genuine content and applying it to malicious content—has remained largely unexplored.
A new research paper, titled “Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models,” delves into this critical security vulnerability. Authored by Tomáš Souˇcek, Sylvestre-Alvise Rebuffi, Pierre Fernandez, Nikola Jovanovi´c, Hady Elsahar, Valeriu Lacatusu, Tuan Tran, and Alexandre Mourachko from Meta FAIR and ETH Zurich, this work introduces a novel approach to investigate and demonstrate watermark forging in post-hoc image watermarking.
The core of their contribution lies in a new method that makes watermark forging simpler and more practical than previous attempts. Unlike other attacks that require extensive access to watermarked data or the watermarking model itself, this new technique needs only a single watermarked image and no prior knowledge of how the watermark was created. This makes it a “black-box” attack, mimicking real-world adversarial scenarios.
Understanding the Attack Mechanism
The researchers developed a “preference model” to determine if an image is watermarked. This model is trained using a ranking loss on purely procedurally generated images, meaning it doesn’t need actual watermarked content for its training. It learns to identify subtle “unnatural” artifacts that indicate the presence of a watermark.
Once trained, this preference model becomes a powerful tool. By optimizing an input image through a process called backpropagation, the model can be used to either remove an existing watermark or forge one onto a new image. The process involves maximizing the preference score of the model, effectively guiding the image modification to either appear “clean” (non-watermarked) or “watermarked.”
The attack pipeline is quite elegant: given a single watermarked image, the system first estimates the embedded watermark. This estimated watermark can then be applied to any new image, making it appear genuinely watermarked to detection systems. The goal is to make these modifications imperceptible, ensuring the forged image looks realistic.
Key Contributions and Implications
The paper highlights three main contributions:
- A novel image preference model trained on synthetically perturbed images, eliminating the need for real watermarked data.
- A gradient-based attack procedure that uses this preference model to remove or forge watermarks through direct image pixel optimization, without needing to know the original watermarking scheme.
- Comprehensive evaluations across various post-hoc image watermarking models, demonstrating the effectiveness of their forging approach and providing insights into which watermarking methods are more robust.
The findings are significant because they question the security of many current post-hoc watermarking approaches. While some content-aware watermarking methods show resistance, others can be easily exploited. The researchers emphasize that their method provides a more realistic assessment of vulnerabilities in the wild, as it operates under practical, low-resource, black-box conditions.
Comparison with Existing Methods
The new method outperforms many prior works in watermark forging, especially for watermarking schemes where the watermark is highly dependent on the image content, like Video Seal. Traditional methods like “image averaging” might work for static watermarks but fail when watermarks are dynamic and content-aware. For watermark removal, their approach is competitive, producing high-quality images with effectively removed watermark information, without the “hallucination” of details seen in some diffusion-based methods.
Also Read:
- When Fonts Become a Stealthy Weapon: Unmasking the Style Attack Disguise on AI Models
- Targeted Forgetting: Improving Data Unlearning in Diffusion Models with Time and Frequency Selection
Limitations and Future Directions
The attack primarily targets post-hoc watermarking methods, not semantic watermarking techniques that alter objects or their locations in AI-generated images. The method may also cause some blurring in areas with natural high-frequency textures, though this could be mitigated with improved training. The authors recommend that watermarking developers ensure their decoders are truly content-aware and explicitly trained to reject watermarks from different source images to strengthen future techniques. You can read the full paper here: Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models.
This research serves as a crucial warning and a call to action for the digital watermarking community, pushing for more robust and secure solutions in an era increasingly dominated by AI-generated content.


