TLDR: This research introduces a comprehensive ElasticSearch-based framework for indexing and analyzing entire large language model (LLM) training datasets, overcoming previous computational limitations. Applied to SwissAI’s FineWeb-2 corpus (1.5TB), it enables fast, real-time problematic content search and retrieval, offering practical tools for safer and more accountable AI systems. The framework supports various query types, from exact phrase matching to fuzzy and semantic searches, and has been successfully deployed despite significant technical challenges.
Large language models (LLMs) have become incredibly powerful, largely due to their training on massive web-scale datasets like Common Crawl. However, the sheer volume and indiscriminate nature of web crawling mean these datasets inevitably contain undesirable content, including hate speech, sexually explicit material, misinformation, copyrighted data, and personally identifiable information. This problematic content can then propagate into the LLMs’ outputs, posing significant challenges for data quality, safety, and ethics.
Previous research into harmful content in these datasets has been limited. Due to computational constraints, studies often relied on small samples, such as analyzing only 1% of Common Crawl files. While these samples might be statistically representative, they can miss important patterns, rare toxic clusters, or domain-specific biases that only become apparent when analyzing the entire dataset.
To address these critical limitations, a new project introduces a comprehensive framework for indexing and analyzing complete LLM training datasets. This framework, built on ElasticSearch, moves beyond sampling to enable thorough, real-time analysis of vast corpora. It’s designed to ingest large training datasets and create searchable indexes that support various query types, including exact phrase matching, approximate matching with fuzzy search, and semantic similarity search. These capabilities can be combined using complex boolean logic, creating a powerful tool for ensuring LLM user safety and robust dataset governance.
Indexing and Analysis in Action
The framework was applied to SwissAI’s multilingual FineWeb-2 corpus, which totals over 1.5TB across four languages (Italian, German, Swiss German, and French). The indexing pipeline is designed for efficiency, streaming parquet files to prevent memory accumulation and extracting text content along with URL metadata for provenance tracking. It uses a multi-level text processing approach, creating different searchable representations of each document, from heavily normalized text for semantic search to exact matches for preserving original structure.
The system employs a distributed architecture with configurable sharding and parallel processing, allowing multiple workers to process different portions of the dataset simultaneously. Performance optimizations, such as bulk indexing and dynamic refresh interval adjustments, maximize throughput during large-scale operations. For instance, the German dataset (634GB) was indexed with optimal performance, achieving 79.25 GB/hour throughput and completing in under 8 hours. The framework successfully indexed over 1.5TB of multilingual web content while maintaining low memory footprints, demonstrating its viability for large-scale document retrieval in resource-constrained environments.
Powerful Search Capabilities
The ElasticSearch search component provides a flexible query execution framework. It can take predefined collections of search terms, such as lists of curse words, slurs, or misinformation keywords, and execute queries against the entire indexed dataset. Beyond simply counting hits, the pipeline extracts highlighted text snippets, providing context for how terms are used and how meaning varies across documents. This allows researchers to understand not just if problematic content exists, but also its nature and distribution.
The system supports various query types:
- Match Query: Uses OR logic between terms and applies full-text analysis (stemming, lowercasing) for broad matches.
- Match Phrase Query: Requires exact phrase matching with configurable proximity tolerance, useful for detecting specific phrases.
- Term Query (Exact): Bypasses text analysis to search for exact tokens, ideal for proper nouns or technical terms.
- Fuzzy Query: Handles typographical errors and variations using edit distance calculations.
- Boolean Must Query: Creates structured boolean queries with configurable logic and scoring control.
Performance tests showed that even for longer phrases (up to 300 words), queries could be executed with low latency on smaller indexes, supporting its use for detecting potential memorization of training data passages by LLMs.
Also Read:
- Knowledge Graphs Enhance Large Language Models for Alzheimer’s Disease Studies
- HealthProcessAI: Bridging the Gap in Healthcare Process Mining with AI
Real-World Application and Technical Considerations
The project demonstrated its utility by conducting searches on the SwissAI FineWeb-2 dataset using the Weaponized Words dictionary to identify potentially harmful or offensive terms. An initial analysis on a filtered version of the FineWeb-edu-score 2 dataset also involved searching for obscene words and misinformation keywords.
Deployment on the CSCS Alps Clariden cluster presented several technical challenges, including Docker incompatibility, memory mapping limitations, and network binding issues. These were overcome through custom container image construction, disabling memory mapping (with some performance implications), and explicit network configuration to bypass proxies and bind to localhost. Future directions include dynamic sharding, adaptive chunk sizing, and leveraging SLURM environment variables for true distributed indexing across multiple nodes.
Compared to other methods like Bloom filters or Infinigram, ElasticSearch offers significantly more advanced search capabilities, including semantic understanding, complex boolean queries, and the ability to retrieve document IDs, term positions, and frequencies, which are crucial for comprehensive content analysis.
This work represents a significant step towards more transparent, searchable, and auditable data pipelines, positioning Switzerland and its SwissAI model as leaders in responsible AI development. By providing tools to thoroughly understand and audit training data, it helps ensure that AI systems are built on ethically-sourced and legally-compliant datasets, fostering public trust and establishing international standards for responsible AI governance. You can read the full technical report here: Going over Fine Web with a Fine-Tooth Comb.


