Unmasking Malicious Clients in Federated Learning with Watermarks

TLDR: A new defense called Coward protects federated learning from backdoor attacks. Unlike previous methods, Coward proactively injects a “watermark” into the global model. Malicious clients, attempting to plant their own backdoors, will inadvertently erase this watermark due to a “collision effect,” while honest clients will retain it. This allows the server to reliably identify and exclude attackers, even with diverse data and sophisticated attacks, without being fooled by common data biases.

Federated Learning (FL) has emerged as a powerful approach for collaborative machine learning, allowing multiple devices or organizations to train a shared model without directly sharing their raw data. This privacy-preserving nature makes FL highly valuable in sensitive domains like healthcare and finance. However, this very strength also creates a blind spot for the central server: it cannot directly observe client-side behavior, opening the door to insidious threats known as backdoor attacks.

In a backdoor attack, malicious clients upload poisoned updates that embed hidden behaviors into the global model. This means the model will function normally on most inputs but will produce attacker-desired outcomes when exposed to specific, predefined trigger patterns. Such attacks undermine the reliability and trustworthiness of FL deployments.

Current defenses against these attacks generally fall into two categories: passive and proactive. Passive defenses try to detect anomalies in client updates after they’ve been submitted. However, these methods often struggle with the real-world complexities of FL, such as non-uniform data distributions across clients (non-i.i.d. data) and the random participation of clients in training rounds. These factors can make benign updates look suspicious, leading to many false alarms.

Proactive defenses, on the other hand, involve the server actively modifying the global model to provoke different reactions from malicious and benign clients. While a pioneering proactive method, BackdoorIndicator, showed promise, it faced a significant challenge: Out-of-Distribution (OOD) bias. Deep neural networks tend to make overconfident and biased predictions on data they haven’t been trained on (OOD data). Since proactive defenses often rely on OOD data for pattern injection or evaluation, this bias could cause honest clients to be mistakenly flagged as malicious, leading to a high false positive rate.

To address these critical limitations, researchers have introduced a novel proactive defense mechanism called Coward. This method is inspired by a new discovery: the “multi-backdoor collision effect.” This effect reveals that when distinct backdoors are planted consecutively, the newer ones can significantly suppress or erase earlier ones. Coward leverages this phenomenon by having the server inject a conflicting “global watermark” into the model.

The core idea of Coward is elegantly inverted compared to previous proactive methods. Instead of detecting attackers by looking for the *retention* of a planted pattern, Coward identifies attackers by evaluating whether the server-injected, conflicting global watermark is *erased* during local training. Benign clients, focused on their legitimate training tasks, will largely retain this watermark. Malicious clients, however, when attempting to implant their own backdoors, will inadvertently cause a collision with the server’s watermark, leading to its suppression or erasure. Clients whose watermark accuracy falls below a certain threshold are then identified as malicious.

This approach offers several key advantages. It preserves the benefits of proactive defenses in handling data heterogeneity, meaning it’s robust even when client data distributions vary widely. Crucially, by treating high watermark accuracy as a sign of benign behavior (rather than malicious), Coward naturally mitigates the adverse impact of OOD bias. The high confidence predictions that OOD bias often induces in benign clients now work *in favor* of detection, rather than against it.

The Coward method involves three main stages: watermark injection, watermark interaction, and watermark detection. During injection, the server carefully embeds a backdoor-based OOD watermark into the global model using a regulated base OOD mapping and a targeted watermark mapping. This process is designed to be robust and not distort the model’s primary task. When clients perform local training (watermark interaction), benign clients simply train on their data, while malicious clients inject their backdoors. The collision effect ensures that the malicious clients’ training interferes with the server’s watermark. Finally, the server performs watermark detection by inspecting the strength of the watermark in the updated local models, excluding those that show significant watermark degradation.

Extensive experiments on benchmark datasets like EMNIST, CIFAR-10, and CIFAR-100 confirm Coward’s effectiveness. It consistently outperforms existing passive and proactive defenses, demonstrating strong resistance to varying degrees of data heterogeneity, advanced stealthy attacks (like PGD, Neurotoxin, and Chameleon), and scenarios involving multiple attackers. The method also proves robust to different choices of OOD datasets, trigger types for both the watermark and the attack, and various detection thresholds. Furthermore, Coward shows resilience against potential adaptive attacks where attackers try to guess and mimic the watermark, as such attempts often lead to a “local collision contradiction” that compromises their own attack objectives.

Also Read:

In essence, Coward provides a practical and robust solution for securing federated learning against backdoor attacks. By leveraging the multi-backdoor collision effect and an inverted detection paradigm, it offers a new perspective for understanding and defending against these threats, paving the way for more secure and trustworthy decentralized AI. You can find more details about this research in the full paper available at https://arxiv.org/pdf/2508.02115.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking Malicious Clients in Federated Learning with Watermarks

Gen AI News and Updates

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates