TLDR: A new research paper introduces Balanced Structural Decomposition (BSD), a novel method for jailbreaking Multimodal Large Language Models (MLLMs). BSD effectively bypasses AI safety filters by creating prompts that balance relevance to the malicious goal with subtle out-of-distribution signals and visual cues. This approach significantly increases attack success rates and the harmfulness of generated outputs across various MLLMs, revealing critical weaknesses in current AI safety mechanisms. The paper also proposes a four-axis evaluation framework to accurately assess jailbreak effectiveness.
Multimodal Large Language Models (MLLMs) are powerful AI systems that combine visual and textual information, enabling them to perform tasks like image captioning and visual question answering. However, a significant concern is their vulnerability to adversarial prompts, often referred to as ‘jailbreaks,’ which can bypass safety mechanisms and lead to the generation of harmful content.
While previous jailbreak strategies have reported high success rates, a closer look reveals that many of these ‘successful’ responses are actually benign, vague, or unrelated to the intended malicious goal. This suggests that current evaluation methods might be overestimating the true effectiveness of such attacks. To address this, researchers have introduced a new four-axis evaluation framework. This framework assesses input ‘on-topicness’ (how relevant the prompt is to the malicious goal), input ‘out-of-distribution (OOD) intensity’ (how novel or unusual the prompt is), output ‘harmfulness,’ and output ‘refusal rate.’ This comprehensive approach helps identify truly effective jailbreaks.
A key finding from their empirical study is a structural trade-off: prompts that are highly relevant to the malicious goal are often blocked by safety filters. Conversely, prompts that are too unusual or ‘out-of-distribution’ might evade detection but fail to produce genuinely harmful content. The sweet spot for effective jailbreaking lies in prompts that strike a balance between relevance and novelty, making them more likely to bypass filters and trigger dangerous outputs.
Building on this insight, the researchers developed a novel recursive rewriting strategy called Balanced Structural Decomposition (BSD). This approach cleverly restructures malicious prompts into semantically aligned sub-tasks. Simultaneously, it introduces subtle out-of-distribution signals and visual cues, making the inputs harder for safety systems to detect. The BSD method works by recursively breaking down an initial malicious prompt into smaller, related sub-tasks. It uses ‘Explore Scores’ to find the best way to split a prompt and ‘Exploit Scores’ to decide which sub-tasks to further expand, ensuring they remain aligned with the original malicious objective. Additionally, it incorporates ‘deception images’ generated by text-to-image models, which are paired with each sub-task to reinforce its purpose while subtly altering the input distribution, further aiding in evasion.
The BSD strategy was rigorously tested across 13 different commercial and open-source MLLMs, including models like GPT-4o, Gemini 2.5 Pro, and various Qwen and InternVL models. The results were significant: BSD consistently led to higher attack success rates, more harmful outputs, and fewer refusals compared to previous methods. Specifically, it improved success rates by an impressive 67% and harmfulness by 21%, highlighting a previously underestimated weakness in current multimodal safety systems.
Also Read:
- Persistent Peril: How Multi-Turn Jailbreaking Amplifies LLM Vulnerabilities
- Unmasking Hidden Dangers in AI: Implicit Reasoning Safety in Vision-Language Models
This research provides crucial insights into the vulnerabilities of MLLMs and offers a new framework for evaluating jailbreak effectiveness. The findings underscore the need for more robust defenses that go beyond simple surface-level input filtering to truly secure these advanced AI models. For more technical details, you can refer to the original research paper.


