TLDR: A new research paper introduces Cross-Modal Preference Steering (CPS), a method that can subtly manipulate AI web agents by making imperceptible changes to both images and text descriptions of online content. This black-box attack, which doesn’t require access to the agent’s internal workings, significantly biases agent selections (e.g., product choices, movie recommendations) while remaining largely undetected, highlighting a critical security vulnerability in AI-powered decision-making systems.
AI-powered web agents are becoming increasingly common, taking on crucial roles from recommending movies to selecting products online. These agents are designed to make decisions on behalf of users, often by combining what they see (visuals) with what they read (textual descriptions). However, new research reveals a significant vulnerability: these agents can be subtly manipulated through a novel attack called Cross-Modal Preference Steering (CPS).
Traditionally, attacks on AI systems have often relied on unrealistic assumptions, such as having full access to the AI model’s internal workings (white-box access) or complete control over the webpages themselves. These requirements severely limit their practical application in the real world. The new research, however, focuses on a much more realistic scenario: an attacker who acts like a regular content publisher, able to edit only their own listing’s images and text, without any insight into the agent’s underlying AI model.
Understanding Cross-Modal Preference Steering (CPS)
CPS is a groundbreaking attack framework that jointly exploits two fundamental vulnerabilities in Vision-Language Model (VLM)-based web agents: visual perception and textual interpretation. By making imperceptible modifications to both an item’s visual and natural language descriptions, CPS can effectively steer an agent’s decisions towards a targeted item.
One key aspect of CPS is its exploitation of a visual vulnerability. Many VLMs rely on similar image encoders, creating a common weakness. The researchers developed a method using Projected Gradient Descent (PGD) to create tiny, unnoticeable changes to images. These changes are so subtle that humans can’t see them, but they can drastically alter how an AI agent perceives an image. For example, an image of an “apple” could be made to appear as an “orange” to the AI, without any visible change to a human observer. This attack is effective even against black-box commercial models like GPT-4.1 and works across various image resolutions.
The second vulnerability CPS targets is textual. AI models, especially those trained with Reinforcement Learning from Human Feedback (RLHF), inadvertently develop systematic preferences for specific linguistic patterns and stylistic elements. Attackers can exploit these biases by crafting descriptions that subtly appeal to the agent’s learned preferences, without triggering any detection mechanisms. This means carefully chosen words and phrases can make an item seem more appealing to the AI.
The power of CPS lies in the synergy of these two approaches. By coordinating both visual perturbations and textual modifications, the attack’s effectiveness is significantly amplified beyond what either single-modal approach could achieve alone. The research utilized GPT-4.1 as an “attacker model” in a feedback loop to iteratively refine both the visual concepts (e.g., injecting “best choice” into an image) and textual descriptions, ensuring semantic consistency while maximizing manipulation.
Also Read:
- ToolTweak: Unmasking a Critical Vulnerability in LLM Agent Tool Selection
- Mapping Data Privacy Risks in Artificial Intelligence Systems
Real-World Impact and Stealth
The researchers evaluated CPS on agents powered by state-of-the-art VLMs, including GPT-4.1, Qwen-2.5VL, and Pixtral-Large, across tasks like movie selection and e-commerce. The results were striking: CPS consistently outperformed leading baseline methods, achieving preference manipulation rates as high as 71%. This means the attack successfully biased the agent’s selection towards the targeted item in a vast majority of cases.
Crucially, CPS also demonstrated remarkable stealth. While other manipulation methods were easily detected (with detection rates as high as 98%), CPS maintained a significantly lower detection rate (between 18.5% and 26.0%). This low detectability means that even advanced AI detectors, explicitly looking for manipulation, struggled to identify the subtle changes made by CPS. This highlights a critical challenge for current defense mechanisms.
The findings of this research underscore an urgent need for robust defenses against such sophisticated attacks. As AI agents become more integrated into our daily lives, mediating high-stakes decisions in areas like market fairness, user autonomy, and information integrity, the ability for any content publisher to invisibly steer their preferences poses immediate and serious risks. This work, detailed further in the research paper, Cross-Modal Content Optimization for Steering Web Agent Preferences, serves as a vital step towards understanding and securing the next generation of autonomous web agents.


