Unmasking Harm: Researchers Use Puzzles to Bypass LLM Safety Filters

TLDR: PUZZLED is a new jailbreaking method that uses word-based puzzles (word search, anagram, crossword) to bypass Large Language Model safety filters. It masks harmful keywords in instructions, embeds them into puzzles, and provides clues, forcing LLMs to use their reasoning capabilities to reconstruct the original harmful prompt. The method achieves high attack success rates (average 88.8%) across state-of-the-art LLMs while being highly efficient.

As large language models (LLMs) become more integrated into our daily lives, ensuring their safety and preventing misuse is a paramount concern. Developers like OpenAI, Anthropic, and Google have implemented robust safety mechanisms to filter out harmful or sensitive requests. However, these systems are not foolproof, and researchers are continuously exploring methods to circumvent them, known as “jailbreak attacks.”

Traditional jailbreak methods often involve manipulating the surface form of prompts, such as encoding instructions, reordering tokens, or wrapping them in code. While some of these techniques have shown success, they tend to be less effective against newer LLMs equipped with stronger safety filters. These methods often don’t actively engage the model’s higher-level linguistic reasoning capabilities.

Introducing PUZZLED: A Reasoning-Driven Jailbreak

A novel approach called PUZZLED introduces a new way to bypass LLM safety mechanisms by leveraging the models’ inherent reasoning abilities. Instead of merely obscuring harmful content, PUZZLED transforms familiar word puzzles into an effective jailbreak strategy. The core idea is to mask keywords in a harmful instruction and present them as word puzzles for the LLM to solve. The model must first solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction.

PUZZLED operates in three main stages:

1. Masking Keywords: Instead of just hiding explicitly harmful words, PUZZLED masks a broader range of keywords, including socially sensitive terms and other important content words. The number of masked words (between three and six) depends on the instruction’s length. It uses predefined lists of “essential” and “recommended” words, prioritizing those directly related to unsafe actions. Crucially, it uses neutral, indexed placeholders like [WORD1] instead of generic [MASK] to frame the task as a puzzle, not censorship.

2. Puzzle Construction: PUZZLED employs three well-known linguistic puzzle formats, adapted for LLM input:

Word Search: Masked words are hidden within a grid of letters, requiring the LLM to perform character-level recognition and directional reasoning.
Anagram: All masked words are concatenated into a single string, and their characters are scrambled. This increases difficulty and obfuscates the original content, forcing the LLM to perform complex cognitive inference.
Crossword: This variant simulates the intersecting nature of crosswords by replacing shared letters among masked words with unique symbols (e.g., #, *, @). Solving one word reveals symbol-to-letter mappings, aiding in the inference of other words.

3. Providing Clues: To assist the LLM, PUZZLED provides clues for each masked word. Each clue includes the word length, part-of-speech information, and an indirect semantic description. These clues are carefully crafted to be euphemistic and indirect, guiding the model without directly exposing harmful terms. Clues are generated using a powerful LLM (GPT-4o) and are cached for reuse to minimize computational overhead.

Also Read:

Effectiveness and Efficiency

PUZZLED was evaluated on five state-of-the-art LLMs: GPT-4.1, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, and LLaMA 3.1 8B Instruct, using datasets like AdvBench and JBB-Behaviors. The results were striking, with PUZZLED achieving a high average attack success rate (ASR) of 88.8%. Specifically, it reached 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet, consistently outperforming existing jailbreak methods.

The method’s success is attributed to its ability to force LLMs into multi-step reasoning. For instance, in a crossword puzzle, the model infers symbol-to-alphabet mappings from one clue and uses them to uncover other words. This demonstrates that PUZZLED actively creates an internal reasoning pipeline within the model, turning the LLM’s cognitive capabilities against its own safety mechanisms.

Beyond its effectiveness, PUZZLED also boasts strong efficiency. Unlike many existing approaches that require multiple LLM calls for prompt generation, PUZZLED relies on a rule-based pipeline for masking and puzzle creation. LLM calls are only made when generating a clue for a new masked token, keeping the total number of calls extremely low. This balance between cost and performance makes PUZZLED a highly practical attack strategy.

The research suggests that LLMs with stronger safety filters might paradoxically be more vulnerable to indirect attacks that exploit their reasoning capabilities. The performance of PUZZLED also varied with model scale, with larger models like GPT-4.1 showing higher ASR, indicating their ability to perform more refined reconstruction through the puzzle structure.

This innovative approach highlights a critical area for future LLM safety research, demonstrating that even familiar human puzzles can be transformed into powerful tools for bypassing AI safeguards. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking Harm: Researchers Use Puzzles to Bypass LLM Safety Filters

Introducing PUZZLED: A Reasoning-Driven Jailbreak

Effectiveness and Efficiency

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates