Boosting Phishing Detection: Introducing the PhreshPhish Dataset and Benchmarks

TLDR: PhreshPhish is a new, large, high-quality dataset of phishing and benign websites designed to improve machine learning-based phishing detection. It addresses common issues in existing datasets like poor quality, data leakage, and unrealistic phishing rates. The paper also introduces benchmark datasets for realistic model evaluation and provides baseline performance results using various machine learning models, highlighting that model performance significantly decreases at real-world phishing rates.

Phishing attacks continue to be a major threat in the digital world, causing significant financial and reputational damage. While machine learning has shown promise in detecting these attacks in real-time, progress has been hampered by a lack of large, high-quality datasets and standardized ways to evaluate models.

Existing datasets often suffer from several issues: poor quality due to challenges in data collection, data leakage (where training and testing data are too similar), and unrealistic base rates (the proportion of phishing sites compared to legitimate ones), which can lead to overly optimistic performance results for detection models.

To address these critical limitations, researchers have introduced PhreshPhish, a new, large-scale, and high-quality dataset of phishing websites. This dataset is significantly larger and offers much higher quality compared to other publicly available datasets, as measured by the estimated rate of invalid or mislabeled data points.

Beyond just the dataset, the paper also proposes a comprehensive set of benchmark datasets. These benchmarks are specifically designed for realistic model evaluation by minimizing data leakage, increasing the difficulty of the detection task, enhancing dataset diversity, and adjusting base rates to reflect what is more likely to be seen in the real world. The availability of PhreshPhish and its benchmarks is expected to enable more realistic and standardized comparisons of phishing detection models, fostering further advancements in the field.

The PhreshPhish dataset was collected over eight months, from July 2024 to March 2025, gathering a wide range of phishing and benign URLs. Phishing URLs were sourced from reputable feeds like PhishTank, the Anti-Phishing Working Group (APWG) eCrime eXchange, and NetCraft. Benign URLs came from anonymized browsing data of millions of Webroot users and Google search results for popular brands, ensuring a realistic sample of legitimate pages.

Collecting this data presented unique challenges because phishing pages are often short-lived, dynamic, and employ techniques like cloaking to avoid detection. To overcome these hurdles, the researchers developed a robust scraping pipeline that uses real browser instances (leveraging Selenium) to capture dynamic content. This approach helps mitigate issues like cloaking (where attackers show different content to scrapers) and ephemerality (pages being taken down quickly).

After collection, the raw data underwent a rigorous cleaning process. This involved both automated and manual steps. URLs were normalized and deduplicated, and pages with titles indicating scraping failures (like “404 Not Found”) were removed. Human annotators then manually inspected representative pages from groups of similar content, ensuring high data quality. Special care was also taken to remove personally identifiable information (PII) from benign pages.

The final PhreshPhish dataset comprises approximately 372,000 data points, with 253,000 benign pages and 119,000 phishing pages.

Realistic Model Evaluation

A key contribution of this work is the creation of realistic test and benchmark datasets. Unlike many existing datasets that lead to inflated performance metrics, PhreshPhish’s test set is temporally separated from the training data and minimizes leakage. The benchmark datasets go further by incorporating difficulty filters, diversity enhancements, and, crucially, varying base rates ranging from 0.05% to 5%. This is vital because real-world phishing detectors operate at much lower base rates than typically found in academic datasets, and model performance is highly sensitive to this factor.

Also Read:

Baseline Performance

To demonstrate the utility of their dataset and benchmarks, the researchers evaluated several common machine learning approaches for real-time phishing detection. These included a linear model, a shallow feedforward neural network (FNN), a BERT-based model fine-tuned on raw HTML and URLs, and a large language model (LLM) used in a zero-shot prediction setting (gpt-4o-mini).

The evaluation showed that while most models performed well on the standard test set (with a higher base rate of 18.3%), their performance significantly dropped as the base rate decreased to more realistic values. The BERT-based model, referred to as GTE, generally performed the best across different base rates. The results underscore that there is significant room for improvement in phishing detection models, especially when evaluated under real-world conditions where false positives (blocking legitimate sites) are highly undesirable.

PhreshPhish represents a significant step forward in providing the cybersecurity community with a robust, high-quality resource for developing and evaluating advanced phishing detection systems. The dataset and benchmarks are publicly available on Hugging Face, and the code used for scraping, processing, and evaluating the baselines is on GitHub. The creators also plan to release periodic updates to keep the dataset current. You can find more details in the full research paper: PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting Phishing Detection: Introducing the PhreshPhish Dataset and Benchmarks

Realistic Model Evaluation

Baseline Performance

Gen AI News and Updates

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

TrojAI Unveils Defend for MCP to Bolster Security for AI Agent Workflows

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates