spot_img
HomeResearch & DevelopmentBoosting Phishing Detection: Introducing the PhreshPhish Dataset and Benchmarks

Boosting Phishing Detection: Introducing the PhreshPhish Dataset and Benchmarks

TLDR: PhreshPhish is a new, large, high-quality dataset of phishing and benign websites designed to improve machine learning-based phishing detection. It addresses common issues in existing datasets like poor quality, data leakage, and unrealistic phishing rates. The paper also introduces benchmark datasets for realistic model evaluation and provides baseline performance results using various machine learning models, highlighting that model performance significantly decreases at real-world phishing rates.

Phishing attacks continue to be a major threat in the digital world, causing significant financial and reputational damage. While machine learning has shown promise in detecting these attacks in real-time, progress has been hampered by a lack of large, high-quality datasets and standardized ways to evaluate models.

Existing datasets often suffer from several issues: poor quality due to challenges in data collection, data leakage (where training and testing data are too similar), and unrealistic base rates (the proportion of phishing sites compared to legitimate ones), which can lead to overly optimistic performance results for detection models.

To address these critical limitations, researchers have introduced PhreshPhish, a new, large-scale, and high-quality dataset of phishing websites. This dataset is significantly larger and offers much higher quality compared to other publicly available datasets, as measured by the estimated rate of invalid or mislabeled data points.

Beyond just the dataset, the paper also proposes a comprehensive set of benchmark datasets. These benchmarks are specifically designed for realistic model evaluation by minimizing data leakage, increasing the difficulty of the detection task, enhancing dataset diversity, and adjusting base rates to reflect what is more likely to be seen in the real world. The availability of PhreshPhish and its benchmarks is expected to enable more realistic and standardized comparisons of phishing detection models, fostering further advancements in the field.

The PhreshPhish dataset was collected over eight months, from July 2024 to March 2025, gathering a wide range of phishing and benign URLs. Phishing URLs were sourced from reputable feeds like PhishTank, the Anti-Phishing Working Group (APWG) eCrime eXchange, and NetCraft. Benign URLs came from anonymized browsing data of millions of Webroot users and Google search results for popular brands, ensuring a realistic sample of legitimate pages.

Collecting this data presented unique challenges because phishing pages are often short-lived, dynamic, and employ techniques like cloaking to avoid detection. To overcome these hurdles, the researchers developed a robust scraping pipeline that uses real browser instances (leveraging Selenium) to capture dynamic content. This approach helps mitigate issues like cloaking (where attackers show different content to scrapers) and ephemerality (pages being taken down quickly).

After collection, the raw data underwent a rigorous cleaning process. This involved both automated and manual steps. URLs were normalized and deduplicated, and pages with titles indicating scraping failures (like “404 Not Found”) were removed. Human annotators then manually inspected representative pages from groups of similar content, ensuring high data quality. Special care was also taken to remove personally identifiable information (PII) from benign pages.

The final PhreshPhish dataset comprises approximately 372,000 data points, with 253,000 benign pages and 119,000 phishing pages.

Realistic Model Evaluation

A key contribution of this work is the creation of realistic test and benchmark datasets. Unlike many existing datasets that lead to inflated performance metrics, PhreshPhish’s test set is temporally separated from the training data and minimizes leakage. The benchmark datasets go further by incorporating difficulty filters, diversity enhancements, and, crucially, varying base rates ranging from 0.05% to 5%. This is vital because real-world phishing detectors operate at much lower base rates than typically found in academic datasets, and model performance is highly sensitive to this factor.

Also Read:

Baseline Performance

To demonstrate the utility of their dataset and benchmarks, the researchers evaluated several common machine learning approaches for real-time phishing detection. These included a linear model, a shallow feedforward neural network (FNN), a BERT-based model fine-tuned on raw HTML and URLs, and a large language model (LLM) used in a zero-shot prediction setting (gpt-4o-mini).

The evaluation showed that while most models performed well on the standard test set (with a higher base rate of 18.3%), their performance significantly dropped as the base rate decreased to more realistic values. The BERT-based model, referred to as GTE, generally performed the best across different base rates. The results underscore that there is significant room for improvement in phishing detection models, especially when evaluated under real-world conditions where false positives (blocking legitimate sites) are highly undesirable.

PhreshPhish represents a significant step forward in providing the cybersecurity community with a robust, high-quality resource for developing and evaluating advanced phishing detection systems. The dataset and benchmarks are publicly available on Hugging Face, and the code used for scraping, processing, and evaluating the baselines is on GitHub. The creators also plan to release periodic updates to keep the dataset current. You can find more details in the full research paper: PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -