spot_img
HomeResearch & DevelopmentLarge Language Models Fall Short in Password Cracking, Study...

Large Language Models Fall Short in Password Cracking, Study Finds

TLDR: A new study empirically investigates the effectiveness of pre-trained Large Language Models (LLMs) like TinyLlama, Falcon-RW-1B, and Flan-T5 for password cracking using synthetic user profiles. The research reveals consistently poor performance from LLMs, achieving less than 1.5% accuracy at Hit@10, significantly underperforming traditional rule-based methods which achieved over 30%. The findings suggest that current LLMs lack the domain adaptation and memorization capabilities required for effective password inference without specific fine-tuning on leaked password datasets.

Large Language Models (LLMs) have shown incredible abilities in understanding and generating human language, leading many to wonder about their potential in various fields, including cybersecurity. One area of particular interest is password guessing, a crucial task in digital forensics and penetration testing. However, a recent study titled “When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking” by Mohammad Abdul Rehman, Syed Imad Ali Shah, Abbas Anwar, and Noor Islam, reveals that current LLMs fall significantly short when it comes to this specific task.

Traditionally, password cracking has relied on methods that use predefined rules, statistical patterns, and combinations derived from leaked password datasets. These methods, while seemingly rigid, have proven highly effective because they are built upon real-world human password creation habits and common mutation strategies.

The researchers set out to investigate whether modern, pre-trained LLMs, without any specific training for password guessing, could compete with these established techniques. They evaluated three popular open-source LLMs: TinyLlama, Falcon-RW-1B, and Flan-T5-Small. To test them, they created a synthetic dataset of 20,000 user profiles, each containing detailed attributes like name, birthdate, hobbies, and email address. The LLMs were then prompted to generate ten likely passwords for each profile based on this information.

The results were striking. The study found that all the tested LLMs performed very poorly, achieving less than 1.5% accuracy even when considering their top ten guesses (Hit@10 metric) for plaintext passwords. Their success rate dropped to near zero when attempting to match SHA-256 hashed passwords, which simulate real-world encrypted password verification. In stark contrast, traditional rule-based and combinator-based cracking methods demonstrated significantly higher success rates, often exceeding 30% Hit@10 accuracy.

This significant disparity highlights a fundamental limitation of general-purpose LLMs in this specialized domain. The researchers explain that while LLMs are excellent at generating grammatically correct and semantically plausible text, they struggle to infer the precise, often idiosyncratic, patterns humans use to create passwords. They lack the specific domain adaptation and memorization capabilities needed to understand and replicate common password transformations, such as appending birth years, inserting special characters, or capitalizing specific words.

The study concludes that despite their impressive linguistic abilities, current LLMs, when used out-of-the-box, are not suitable replacements for traditional password cracking tools. Their tendency to generate plausible but imprecise guesses, and their failure to model real-world password transformation patterns, limit their practical utility in cybersecurity applications like digital forensics or penetration testing.

Also Read:

However, the research also points to future possibilities. The authors suggest that fine-tuning LLMs on large-scale password datasets (while carefully considering ethical and legal implications) could significantly improve their performance. Additionally, integrating user-specific transformation patterns through advanced prompting or combining LLMs with rule-based filtering mechanisms could lead to more effective hybrid approaches. This study provides crucial insights into the boundaries of LLM generalization and sets the stage for future research into more robust and secure password modeling.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -