TLDR: HISPASpoof is the first large-scale Spanish dataset for detecting and attributing synthetic speech. It addresses the gap in speech forensics, which has largely focused on English and Chinese. The research shows that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof significantly improves performance. It also demonstrates the feasibility of attributing synthetic speech to its generator, even with challenges in open-set scenarios.
The rapid advancements in artificial intelligence have brought forth incredibly realistic synthetic speech, often referred to as deepfakes. Technologies like Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) can now generate voices that are almost indistinguishable from human speech, mimicking spectral, prosodic, and linguistic characteristics. While these innovations have beneficial applications in areas like virtual assistants and media production, they also raise significant concerns about potential misuse, including misinformation, impersonation, and fraud.
To combat these threats, the field of speech forensics has developed methods to detect synthetic speech and even attribute it to the specific synthesizer used. However, most of these efforts have historically focused on English and Chinese languages. This leaves a critical gap for other widely spoken languages, particularly Spanish, which is spoken by over 600 million people worldwide.
Addressing this crucial need, researchers have introduced HISPASpoof, the first large-scale Spanish dataset specifically designed for synthetic speech detection and attribution. This groundbreaking dataset provides a vital benchmark for developing more reliable and inclusive speech forensics tools for the Spanish-speaking world.
What is HISPASpoof?
HISPASpoof is a comprehensive dataset that includes both real and synthetic Spanish speech. The real speech samples are sourced from public corpora, covering six distinct Spanish accents: Colombian, Argentinian, Chilean, Mexican, Peruvian, and Peninsular. This ensures a broad representation of phonetic characteristics within the Spanish language. For synthetic speech, the dataset incorporates samples generated by six modern zero-shot TTS systems, which are capable of creating synthetic voices from just a few seconds of reference speech without requiring extensive speaker-specific training.
The dataset is structured into two main subsets: a detection subset, aimed at distinguishing between real and synthetic speech, and an attribution subset, designed to identify the specific method or synthesizer used to generate synthetic speech. It features a gender-balanced speaker distribution across the six accents and includes a robust division into training, validation, and test sets. Crucially, the test set includes both unseen speakers and unseen speech generators, allowing for a realistic evaluation of how well detection and attribution methods generalize to new, unknown voices and synthesis techniques.
Key Findings from the Research
The research paper evaluates five representative synthetic speech detection methods using HISPASpoof and other existing datasets. The findings highlight several important points:
- English-Trained Detectors Fail Spanish: When synthetic speech detectors trained exclusively on English datasets (like ASVspoof2019) were tested on Spanish speech, their performance significantly declined. This confirms that synthetic speech detection is a language-sensitive task, and models optimized for one language do not easily transfer to another.
- HISPASpoof Improves Spanish Detection: Training these detectors on the HISPASpoof dataset substantially improved their performance on Spanish synthetic speech. This demonstrates the critical importance of having large-scale, language-specific datasets for effective training.
- Multilingual Training Helps, But is Not Enough: While training on multilingual datasets (like ODSS, which includes Spanish, English, and German) showed some improvement in generalization across languages, the performance on Spanish speech still lagged compared to training directly on HISPASpoof. This suggests that existing multilingual corpora might not have sufficient representation of Spanish phonological characteristics.
- Limited Data is a Barrier: Training detectors on a smaller Spanish subset of an existing dataset (ODSS Spanish subset) yielded inconsistent and often poor results, underscoring the need for extensive data for robust model training.
Attribution Capabilities
Beyond just detecting synthetic speech, the research also explored the more complex task of attribution – identifying which specific synthesizer created a given synthetic voice. The HISPASpoof dataset proved valuable here as well:
- High Accuracy for Known Synthesizers: In a “closed-set” scenario, where all synthesizers in the test set were known during training, the methods achieved near-perfect attribution performance.
- Challenges with Unknown Synthesizers: In an “open-set” scenario, which included synthetic speech from generators not seen during training, performance naturally dropped. However, some methods, particularly PaSST and Spec-ResNet, showed better generalizability to these unseen generators. Interestingly, the study noted that architecturally similar synthesizers, like XTTS-v1 and XTTS-v2, were often confused, highlighting a specific challenge in differentiating closely related generation methods.
Also Read:
- Uncovering Hidden Vulnerabilities in Audio Deepfake Detection Systems
- Confronting Digital Deception: The OPENFAKE Initiative
Conclusion and Future Directions
The introduction of HISPASpoof marks a significant step forward for speech forensics in Spanish. It provides a much-needed resource to develop and evaluate robust detection and attribution methods for synthetic Spanish speech. The findings clearly demonstrate that language-specific datasets are essential for effective synthetic speech detection. Future work will focus on training detectors in scenarios with scarce language-specific data and addressing the technical challenges of cross-lingual generalization through the systematic development of multilingual datasets and advanced training protocols.


