TLDR: A new defense called Coward protects federated learning from backdoor attacks. Unlike previous methods, Coward proactively injects a “watermark” into the global model. Malicious clients, attempting to plant their own backdoors, will inadvertently erase this watermark due to a “collision effect,” while honest clients will retain it. This allows the server to reliably identify and exclude attackers, even with diverse data and sophisticated attacks, without being fooled by common data biases.
Federated Learning (FL) has emerged as a powerful approach for collaborative machine learning, allowing multiple devices or organizations to train a shared model without directly sharing their raw data. This privacy-preserving nature makes FL highly valuable in sensitive domains like healthcare and finance. However, this very strength also creates a blind spot for the central server: it cannot directly observe client-side behavior, opening the door to insidious threats known as backdoor attacks.
In a backdoor attack, malicious clients upload poisoned updates that embed hidden behaviors into the global model. This means the model will function normally on most inputs but will produce attacker-desired outcomes when exposed to specific, predefined trigger patterns. Such attacks undermine the reliability and trustworthiness of FL deployments.
Current defenses against these attacks generally fall into two categories: passive and proactive. Passive defenses try to detect anomalies in client updates after they’ve been submitted. However, these methods often struggle with the real-world complexities of FL, such as non-uniform data distributions across clients (non-i.i.d. data) and the random participation of clients in training rounds. These factors can make benign updates look suspicious, leading to many false alarms.
Proactive defenses, on the other hand, involve the server actively modifying the global model to provoke different reactions from malicious and benign clients. While a pioneering proactive method, BackdoorIndicator, showed promise, it faced a significant challenge: Out-of-Distribution (OOD) bias. Deep neural networks tend to make overconfident and biased predictions on data they haven’t been trained on (OOD data). Since proactive defenses often rely on OOD data for pattern injection or evaluation, this bias could cause honest clients to be mistakenly flagged as malicious, leading to a high false positive rate.
To address these critical limitations, researchers have introduced a novel proactive defense mechanism called Coward. This method is inspired by a new discovery: the “multi-backdoor collision effect.” This effect reveals that when distinct backdoors are planted consecutively, the newer ones can significantly suppress or erase earlier ones. Coward leverages this phenomenon by having the server inject a conflicting “global watermark” into the model.
The core idea of Coward is elegantly inverted compared to previous proactive methods. Instead of detecting attackers by looking for the *retention* of a planted pattern, Coward identifies attackers by evaluating whether the server-injected, conflicting global watermark is *erased* during local training. Benign clients, focused on their legitimate training tasks, will largely retain this watermark. Malicious clients, however, when attempting to implant their own backdoors, will inadvertently cause a collision with the server’s watermark, leading to its suppression or erasure. Clients whose watermark accuracy falls below a certain threshold are then identified as malicious.
This approach offers several key advantages. It preserves the benefits of proactive defenses in handling data heterogeneity, meaning it’s robust even when client data distributions vary widely. Crucially, by treating high watermark accuracy as a sign of benign behavior (rather than malicious), Coward naturally mitigates the adverse impact of OOD bias. The high confidence predictions that OOD bias often induces in benign clients now work *in favor* of detection, rather than against it.
The Coward method involves three main stages: watermark injection, watermark interaction, and watermark detection. During injection, the server carefully embeds a backdoor-based OOD watermark into the global model using a regulated base OOD mapping and a targeted watermark mapping. This process is designed to be robust and not distort the model’s primary task. When clients perform local training (watermark interaction), benign clients simply train on their data, while malicious clients inject their backdoors. The collision effect ensures that the malicious clients’ training interferes with the server’s watermark. Finally, the server performs watermark detection by inspecting the strength of the watermark in the updated local models, excluding those that show significant watermark degradation.
Extensive experiments on benchmark datasets like EMNIST, CIFAR-10, and CIFAR-100 confirm Coward’s effectiveness. It consistently outperforms existing passive and proactive defenses, demonstrating strong resistance to varying degrees of data heterogeneity, advanced stealthy attacks (like PGD, Neurotoxin, and Chameleon), and scenarios involving multiple attackers. The method also proves robust to different choices of OOD datasets, trigger types for both the watermark and the attack, and various detection thresholds. Furthermore, Coward shows resilience against potential adaptive attacks where attackers try to guess and mimic the watermark, as such attempts often lead to a “local collision contradiction” that compromises their own attack objectives.
Also Read:
- New Research Uncovers Backdoor Vulnerabilities in AI Face Detection Systems
- Securing LLMs: A Dual Approach to Combat Prompt Injection and Data Leaks
In essence, Coward provides a practical and robust solution for securing federated learning against backdoor attacks. By leveraging the multi-backdoor collision effect and an inverted detection paradigm, it offers a new perspective for understanding and defending against these threats, paving the way for more secure and trustworthy decentralized AI. You can find more details about this research in the full paper available at https://arxiv.org/pdf/2508.02115.


