TLDR: This paper compares traditional Machine Learning (ML), Deep Learning (DL), and quantized Large Language Models (LLMs) for phishing detection. It finds that while LLMs currently have lower raw accuracy than ML/DL, they are better at detecting subtle, context-based phishing cues and are more resilient to LLM-rephrased attacks. Models like DeepSeek R1 Distill Qwen 14B achieve over 80% accuracy with reasonable VRAM. ML/DL models are significantly faster for inference, making a hybrid approach (ML/DL for bulk, LLMs for complex cases) the most efficient and effective solution, balancing accuracy, interpretability, and resource consumption.
In the rapidly evolving landscape of cyber threats, phishing attacks have become increasingly sophisticated, leveraging advancements in Artificial Intelligence (AI) to craft highly deceptive messages. This escalation necessitates equally advanced detection systems that are not only accurate but also computationally efficient. A recent research paper delves into this challenge, offering a comprehensive comparison of traditional Machine Learning (ML), Deep Learning (DL), and the emerging field of quantized small-parameter Large Language Models (LLMs) for identifying phishing attempts.
The study highlights that while LLMs currently may not match the raw accuracy of established ML and DL methods, they possess a unique strength: their ability to discern subtle, context-based phishing cues. This capability is crucial in an era where attackers use AI to rephrase emails, making them harder for conventional detectors to spot. The research also explores the impact of different prompting strategies on LLM performance, revealing that AI-rephrased emails can significantly degrade the effectiveness of both ML and LLM-based detectors.
One of the key findings from the benchmarking is the viability of models like DeepSeek R1 Distill Qwen 14B (Q8_0). This model achieved a competitive accuracy of over 80% while using only 17 GB of VRAM, making it a promising candidate for cost-efficient deployment in real-world scenarios. The paper further examines the adversarial robustness of these models and their cost-performance trade-offs. It demonstrates how lightweight LLMs can provide clear, interpretable explanations for their decisions, which is invaluable for real-time decision-making in cybersecurity operations.
The Evolving Threat Landscape
Phishing, a pervasive cyberthreat, manipulates users into revealing sensitive information. The 2020s have seen a surge in its complexity, moving beyond simple deceptive emails to multi-vector attacks across various platforms, including social media and text messages. Attackers are now employing AI-generated content and deepfake techniques, exploiting human vulnerabilities with unprecedented precision. While ML and Natural Language Processing (NLP) have shown promise in detection, the rise of Generative AI (GenAI) and LLMs has further intensified the threat, enabling automated and highly targeted phishing campaigns.
Previous research has indicated that traditional phishing detectors, such as Gmail Spam Detector, often see a decline in accuracy when faced with LLM-rephrased emails. Interestingly, LLMs themselves showed better classification performance with these rephrased emails. However, the computational cost of training and deploying large LLMs is substantial, raising questions about their practical applicability given the energy consumption and resource demands.
Comparing Detection Approaches
The research conducted a detailed comparative analysis across different model types:
- Traditional Machine Learning Models: Models like Random Forest, Logistic Regression, Naive Bayes, and Support Vector Machines (SVM) were evaluated. Random Forest showed the highest accuracy (98.01%), while Logistic Regression offered an excellent balance of accuracy (97.04%) and unparalleled speed, making it highly suitable for real-time applications.
- Deep Learning Models: These models, including variants of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) augmented with LSTM and GRU layers, demonstrated superior capabilities in extracting complex patterns. They achieved very high accuracy, often exceeding 98%. The study found that using a Leaky ReLU activation function further enhanced performance, with the Bi-Directional GRU model achieving an impressive 98.77% accuracy.
- Small Quantized LLMs: The study specifically focused on quantized small-parameter LLMs to address the resource intensity of larger models. Models like DeepSeek R1 Distill Qwen 14B achieved 79% accuracy. While lower than the best ML/DL models, these LLMs offer the unique advantage of providing human-readable explanations for their classifications, which can significantly aid users in understanding why an email is flagged as suspicious.
Adversarial Robustness and Resource Consumption
A critical aspect of the study was evaluating how well models perform against adversarial attacks, particularly LLM-rephrased phishing emails. Traditional ML models experienced a notable drop in accuracy when confronted with such rephrased content. For instance, Naive Bayes and Logistic Regression saw accuracy declines of over 5 percentage points. In contrast, LLMs like GPT-4 demonstrated greater resilience, with a more modest decrease in accuracy, highlighting their potential in countering advanced evasion strategies.
Resource consumption was a major consideration. The study found that ML models are vastly more efficient for inference, being approximately 1.38 million times faster than LLMs and 1308 times faster than DL models. This stark difference underscores that using LLMs to check every email is not computationally feasible for high-volume scenarios. The environmental impact of large LLMs, with their significant energy and water consumption, also raises sustainability concerns.
Also Read:
- Advancing Network Security with Large Language Models: A New Era for Intrusion Detection
- Securing IoT Networks: How Large Language Models Enhance Threat Detection
A Hybrid Path Forward
The research concludes that while ML and DL models excel in raw accuracy and speed for the majority of phishing cases, LLMs offer unique benefits in handling complex, context-driven attacks and providing interpretable explanations. This suggests that the most optimal solution for modern phishing detection systems is a hybrid approach. In such a framework, efficient ML and DL models would handle the bulk of email traffic, while a fine-tuned, small LLM would be deployed for more challenging or ambiguous cases that require deeper contextual reasoning and human-understandable insights. This strategy would optimize accuracy, cost-efficiency, and environmental impact simultaneously.
Future efforts in this field should concentrate on developing LLMs specifically trained on phishing-centric datasets, enhancing adversarial defense mechanisms, and integrating real-time threat intelligence. The paper, available at Phishing Detection in the Gen-AI Era: Quantized LLMs vs Classical Models, offers a practical roadmap for integrating explainable and efficient AI into contemporary cybersecurity frameworks, paving the way for more robust and adaptable phishing defense systems in the Gen-AI era.


