JPS: A New Frontier in Understanding Multimodal AI Vulnerabilities

TLDR: JPS is a novel method for jailbreaking Multimodal Large Language Models (MLLMs) that goes beyond simply bypassing safety filters. It focuses on generating high-quality, intent-aligned harmful responses by collaboratively optimizing subtle visual perturbations in images and refined textual steering prompts. The paper introduces the Malicious Intent Fulfillment Rate (MIFR) as a new metric to accurately assess the practical utility of these harmful outputs, demonstrating JPS’s state-of-the-art performance in both attack success and malicious intent fulfillment.

Multimodal Large Language Models (MLLMs), which can understand both images and text, are becoming increasingly powerful. However, with their growing capabilities comes a significant concern: their security. One major area of concern is ‘jailbreak attacks,’ where these AI models are tricked into generating harmful or unsafe content.

Current research in jailbreaking MLLMs often focuses primarily on achieving a high ‘Attack Success Rate’ (ASR), meaning the model simply bypasses its safety filters. But a critical issue has been overlooked: whether the AI’s response actually fulfills the attacker’s malicious goal. Often, these “successful” jailbreaks result in low-quality outputs that might bypass safety but lack real harmful content or fail to follow instructions.

Addressing the Quality Gap in Jailbreak Responses

This research paper, titled “JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering,” highlights two main problems with existing jailbreak methods:

Failed Instruction Following: The AI’s response doesn’t directly address the user’s core malicious request. For example, instead of providing steps to build a bomb, it might give a theoretical scientific explanation.
Insufficient Content Harmfulness: The responses suggest impractical actions (like needing nuclear materials for a bomb) or provide useless advice (like mixing baking soda and vinegar for an explosion).

These issues arise because many attack strategies introduce unnecessary constraints or because the ASR evaluation doesn’t penalize low-utility responses. It often just checks if *any* harmful content is present, not if it’s useful to the attacker.

Introducing JPS: A Collaborative Approach

To solve this, the researchers propose JPS, which stands for “Jailbreak MLLMs with collaborative visual Perturbation and textual Steering.” JPS introduces a clever strategy: it separates the objectives of bypassing safety and steering the quality of the response.

Here’s how JPS works:

Visual Perturbation for Safety Bypass: JPS uses subtle, target-guided adversarial changes to an image. These changes are almost imperceptible to humans but are designed to trick the MLLM into bypassing its safety mechanisms. This visual component is continuously optimized to be effective.
Textual Steering for Quality Control: Alongside the image, JPS uses a “steering prompt.” This prompt is specifically designed to guide the MLLM’s response to be high-quality, meaning it accurately follows instructions and provides genuinely harmful content. This textual prompt is refined iteratively using a “multi-agent system” involving three AI roles: a Judger (evaluates responses), a Summarizer (identifies common issues), and a Revisor (rewrites the prompt).

These visual and textual components are co-optimized in an iterative process, ensuring they work together synergistically for enhanced performance.

A New Metric: Malicious Intent Fulfillment Rate (MIFR)

To properly evaluate the quality of jailbreak outcomes, the paper introduces a new metric: the Malicious Intent Fulfillment Rate (MIFR). Unlike ASR, which only checks if a response is harmful, MIFR assesses whether the response truly fulfills the attacker’s specific malicious intent. This is evaluated using a powerful reasoning-based LLM, ensuring a more stringent and practical assessment of the attack’s success.

Also Read:

Key Findings and Impact

Experiments show that JPS sets a new standard in both ASR and MIFR across various MLLMs (like InternVL2, Qwen2-VL, MiniGPT-4) and benchmarks. This demonstrates that JPS is not only effective at bypassing safety but also at generating responses that are genuinely useful for malicious purposes. The research also confirms that each component of JPS—the adversarial image, the steering prompt, and the multi-agent system—is crucial for its success.

JPS also shows strong robustness against existing defense techniques, highlighting the ongoing challenge in securing MLLMs. This research provides valuable insights into the vulnerabilities of multimodal AI, paving the way for developing more robust defenses in the future. You can find the full research paper for more technical details at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

JPS: A New Frontier in Understanding Multimodal AI Vulnerabilities

Addressing the Quality Gap in Jailbreak Responses

Introducing JPS: A Collaborative Approach

A New Metric: Malicious Intent Fulfillment Rate (MIFR)

Key Findings and Impact

Gen AI News and Updates

Unveiling LLM Refusal: A Multi-Directional Approach Using Self-Organizing Maps

Unlocking Hidden Memories: How LLMs Reveal Training Data When Confused

Unmasking LLM Vulnerabilities: A New Framework for Factual Memory Attacks

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates