Large Language Models Fall Short in Password Cracking, Study Finds

TLDR: A new study empirically investigates the effectiveness of pre-trained Large Language Models (LLMs) like TinyLlama, Falcon-RW-1B, and Flan-T5 for password cracking using synthetic user profiles. The research reveals consistently poor performance from LLMs, achieving less than 1.5% accuracy at Hit@10, significantly underperforming traditional rule-based methods which achieved over 30%. The findings suggest that current LLMs lack the domain adaptation and memorization capabilities required for effective password inference without specific fine-tuning on leaked password datasets.

Large Language Models (LLMs) have shown incredible abilities in understanding and generating human language, leading many to wonder about their potential in various fields, including cybersecurity. One area of particular interest is password guessing, a crucial task in digital forensics and penetration testing. However, a recent study titled “When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking” by Mohammad Abdul Rehman, Syed Imad Ali Shah, Abbas Anwar, and Noor Islam, reveals that current LLMs fall significantly short when it comes to this specific task.

Traditionally, password cracking has relied on methods that use predefined rules, statistical patterns, and combinations derived from leaked password datasets. These methods, while seemingly rigid, have proven highly effective because they are built upon real-world human password creation habits and common mutation strategies.

The researchers set out to investigate whether modern, pre-trained LLMs, without any specific training for password guessing, could compete with these established techniques. They evaluated three popular open-source LLMs: TinyLlama, Falcon-RW-1B, and Flan-T5-Small. To test them, they created a synthetic dataset of 20,000 user profiles, each containing detailed attributes like name, birthdate, hobbies, and email address. The LLMs were then prompted to generate ten likely passwords for each profile based on this information.

The results were striking. The study found that all the tested LLMs performed very poorly, achieving less than 1.5% accuracy even when considering their top ten guesses (Hit@10 metric) for plaintext passwords. Their success rate dropped to near zero when attempting to match SHA-256 hashed passwords, which simulate real-world encrypted password verification. In stark contrast, traditional rule-based and combinator-based cracking methods demonstrated significantly higher success rates, often exceeding 30% Hit@10 accuracy.

This significant disparity highlights a fundamental limitation of general-purpose LLMs in this specialized domain. The researchers explain that while LLMs are excellent at generating grammatically correct and semantically plausible text, they struggle to infer the precise, often idiosyncratic, patterns humans use to create passwords. They lack the specific domain adaptation and memorization capabilities needed to understand and replicate common password transformations, such as appending birth years, inserting special characters, or capitalizing specific words.

The study concludes that despite their impressive linguistic abilities, current LLMs, when used out-of-the-box, are not suitable replacements for traditional password cracking tools. Their tendency to generate plausible but imprecise guesses, and their failure to model real-world password transformation patterns, limit their practical utility in cybersecurity applications like digital forensics or penetration testing.

Also Read:

However, the research also points to future possibilities. The authors suggest that fine-tuning LLMs on large-scale password datasets (while carefully considering ethical and legal implications) could significantly improve their performance. Additionally, integrating user-specific transformation patterns through advanced prompting or combining LLMs with rule-based filtering mechanisms could lead to more effective hybrid approaches. This study provides crucial insights into the boundaries of LLM generalization and sets the stage for future research into more robust and secure password modeling.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Large Language Models Fall Short in Password Cracking, Study Finds

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates