TLDR: This research introduces a multi-key watermarking defense to protect generative AI models from “watermark stealing” attacks. Instead of using a single secret key, the system randomly selects from multiple keys to embed a watermark. During verification, if exactly one key is detected, the content is authentic; if multiple or no keys are found, it’s a forgery. This method significantly reduces the success rate of attackers trying to falsely attribute harmful content to AI providers, even against sophisticated attacks, and works across different content types like text and images.
In the rapidly evolving landscape of artificial intelligence, generative models (GenAI) are producing increasingly realistic content, from text to images. While this offers immense creative potential, it also raises critical questions about content authenticity and accountability. How can we be sure that a piece of content truly originated from a specific AI provider? Watermarking has emerged as a promising solution, embedding a hidden signal into generated content that can later be verified using a secret key.
However, a significant threat looms: watermark stealing attacks. These attacks involve malicious users attempting to forge a provider’s watermark into content that was not generated by their models, often with the intent to falsely accuse the provider of creating harmful material. Imagine a scenario where a user generates hate speech or misinformation independently, then manipulates it to appear as if it came from a reputable AI service, damaging the provider’s reputation and potentially exposing them to legal liabilities.
The Challenge of Forgery
Traditional watermarking schemes, which often rely on a single secret key, have proven vulnerable to these stealing attacks. Attackers can collect numerous watermarked samples from a provider’s model, analyze the hidden patterns, and then learn to mimic these patterns to embed a forged watermark into their own content. Existing defenses, such as storing all generated content in a database for verification or using complex statistical tests, often come with trade-offs like privacy concerns, high computational costs, or limited effectiveness against sophisticated adversaries.
Introducing Multi-Key Watermarking
A groundbreaking new defense, detailed in the research paper Mitigating Watermark Stealing Attacks in Generative Models via Multi-Key Watermarking, proposes a robust solution: multi-key watermarking. Instead of using just one secret key, this approach leverages a set of multiple watermarking keys. When a GenAI provider generates content in response to a user query, they randomly select one of these ‘r’ keys to embed the watermark.
The detection process is equally innovative. When verifying content, the system tests for the presence of each of the ‘r’ keys. The content is deemed authentic only if exactly one key is detected with statistical significance. If no keys are detected, it means the content was not generated by the provider’s model. Crucially, if multiple keys are detected, it signals a forgery attempt. This mechanism makes it significantly harder for attackers because they cannot easily learn a single, consistent watermark pattern from the provider’s outputs, as each sample might have been watermarked with a different, unknown key.
Proven Effectiveness Across Modalities
The research demonstrates the effectiveness of this multi-key defense across various scenarios and content types:
-
Textual Content: While single-key watermarking schemes showed high vulnerability, with attackers achieving up to 79% success rates in forging harmful content, the multi-key approach drastically reduced these rates. For instance, with four keys, spoofing success rates dropped from over 70% to as low as 15-26% across different watermarking algorithms like KGW-SelfHash, Unigram, KGW-Soft, and KGW-Hard.
-
Adaptive Attacks: Even against highly sophisticated attackers who had access to labeled training data (knowing which key generated which content), the multi-key defense proved resilient. Despite attackers achieving high accuracy in identifying key-specific patterns, their spoofing success rate plateaued at 65%, demonstrating the method’s strong protection.
-
Mixed Multi-Key Strategy: Further enhancing security, the researchers explored combining multiple watermarking methods with multiple keys. This ‘mixed multi-key’ approach achieved even lower spoofing success rates, dropping to as low as 9-13%.
-
Image Watermarks: The defense is not limited to text. When applied to image forgery detection using Tree-Ring watermarking, the multi-key approach reduced spoofing success rates from a staggering 100% (for single-key) down to just 2% when using four keys.
Balancing Security and Utility
A critical aspect of any security measure is its impact on legitimate use. The multi-key watermarking scheme maintains a high detection accuracy for genuine watermarked content, with false negative rates remaining remarkably low (between 0-3%). This indicates that the enhanced security does not come at the cost of misidentifying authentic content.
Also Read:
- Breaking the Trade-off: A New Approach to LLM Watermarking Resilience
- Obfuscation Undermines Code Watermarking Efforts
Looking Ahead
While multi-key watermarking represents a significant leap forward in securing generative AI, the research acknowledges ongoing challenges. The computational overhead increases linearly with the number of keys, though it remains manageable for current applications. Future work will also need to address even more sophisticated attack vectors, such as those that require only a single watermarked sample to forge content.
Ultimately, this multi-key strategy offers a practical and deployable defense that can be applied to existing watermarking systems without requiring extensive retraining of generative models. By bolstering the integrity of content attribution, it paves the way for more trustworthy and accountable AI systems in an increasingly complex digital world.


