Imitating AI Signatures: A New Threat to LLM Watermarking

TLDR: A new research paper introduces DITTO, a framework that allows a malicious LLM to generate text with the authentic-looking watermark of a trusted victim model. This “spoofing attack” exploits “watermark radioactivity” through knowledge distillation, enabling misattribution of harmful content. DITTO is effective against various watermarking schemes and can increase spoofing strength without degrading text quality, revealing a critical security flaw in current LLM authorship verification.

Large Language Models (LLMs) are becoming integral to our daily lives, from powering industrial applications to assisting in education and personal tasks. Their ability to generate human-like text at scale is a powerful tool, but it also brings significant concerns about authenticity and trust. As these models become more integrated, the need to detect and verify LLM-generated text, and even identify which specific model generated it, has become paramount. This is where LLM watermarking comes into play.

Watermarking for LLMs involves embedding an imperceptible, yet machine-detectable, signal into the model’s outputs. This signal is intended to convey authorship information, helping to establish the provenance of AI-generated content. Major industry players like Meta, OpenAI, and Google Deepmind have been exploring watermarking as a practical solution for this very purpose, aiming to ensure accountability and manage the risks associated with different models.

However, a new research paper introduces a significant challenge to the core assumption of LLM watermarking: that a specific watermark reliably proves authorship by a specific model. The paper, titled DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation, demonstrates a sophisticated attack called “watermark spoofing.”

Understanding Watermark Spoofing

Unlike “scrubbing attacks,” which aim to remove embedded watermarks to evade detection, spoofing attacks are far more malicious. They enable a malicious model to generate text that appears to carry the authentic watermark of a trusted, victim model. This allows for the seamless misattribution of harmful content, such as disinformation or propaganda, to reputable sources, creating a misleading sense of certainty for detectors.

The DITTO framework operates under a practical “black-box” setting, meaning the attacker doesn’t need to know the target model’s internal workings or its specific watermarking scheme. The key to this attack lies in repurposing a phenomenon known as “watermark radioactivity.”

Watermark Radioactivity: From Detection to Attack

Watermark radioactivity is the unintended inheritance of watermark signals by a student LLM when it is fine-tuned on outputs generated by a watermarked teacher model. Previously, this phenomenon was seen as a discoverable trait, offering a tool for provenance auditing. DITTO, however, transforms this into a potent attack vector.

The DITTO framework, which stands for “Distilled watermark Imitation of a Targeted Teacher’s Outputs,” uses a three-stage knowledge distillation process:

1. Watermark Inheritance: First, a large dataset of text is generated using the watermarked target model. An open-source student model is then trained on this watermarked data through a process called supervised fine-tuning (SFT). During this training, the student model “inherits” the statistical patterns of the teacher’s watermark.

2. Watermark Extraction: Next, the inherited watermark pattern is isolated as a quantitative signal, called the Extracted Watermark Signal (EWS). This is done by comparing the output distributions (logits) of the original student model before training and the trained student model. This comparison helps to identify the systematic bias introduced by the watermark, both globally and for specific text prefixes.

3. Spoofing Attack: Finally, during the text generation process (inference), this extracted EWS is added directly to the logits of the attacker’s unwatermarked model. This injection guides the attacker’s model to produce text that contains the unique watermarking signal of the victim model, successfully completing the spoofing attack.

Breaking the Trade-Off: Strength Without Quality Loss

A crucial finding of the DITTO research is its ability to overcome a conventional trade-off between attack strength and text quality. Typically, increasing the intensity of an attack might degrade the quality of the generated text, making it easier to detect. However, DITTO demonstrates that its spoofing intensity can be significantly increased without a discernible degradation in text quality. The text quality metric (perplexity) fluctuates unpredictably rather than consistently increasing with attack strength, making the attack highly evasive and stealthy.

Furthermore, DITTO proves to be highly adaptable. It is effective not only against n-gram-based watermarks (like KGW) but also against fundamentally different sampling-based schemes (like SynthID). This versatility suggests that DITTO doesn’t just exploit scheme-specific features but learns the more fundamental patterns of distributional distortion caused by any watermark.

Also Read:

Implications and Future Directions

This work reveals a critical security gap in text authorship verification, highlighting a systemic vulnerability with direct consequences for provenance, content moderation, licensing compliance, and incident response. The authors argue for a paradigm shift from merely detecting the presence of a watermark to actively verifying its authenticity, prioritizing adversarial resilience in the design of future watermarking technologies.

While DITTO successfully demonstrates the feasibility of watermark spoofing, the researchers also acknowledge limitations. The effectiveness of the attack depends on how faithfully the student model inherits the watermark. Future research could explore optimizing this “transfer” process and developing defense mechanisms to resist such sophisticated spoofing attacks.

The DITTO framework serves as a crucial “red-teaming” exercise for the AI safety community, proactively identifying and exposing a significant vulnerability. By understanding these threats, the goal is to foster more secure AI ecosystems capable of distinguishing authentic watermarks from expertly imitated ones.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Imitating AI Signatures: A New Threat to LLM Watermarking

Understanding Watermark Spoofing

Watermark Radioactivity: From Detection to Attack

Breaking the Trade-Off: Strength Without Quality Loss

Implications and Future Directions

Gen AI News and Updates

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates