TLDR: A new method called Prompt-in-Image embeds text instructions directly into images for Vision-Language Models (VLMs). It significantly improves Qwen2.5-VL’s accuracy and reduces hallucination by enhancing cross-modal alignment. However, it severely degrades LLaVA-1.5 and InstructBLIP’s performance due to their CLIP-based encoders exhibiting excessive attention bias towards the embedded text, disrupting visual understanding.
Vision-Language Models, or VLMs, are advanced artificial intelligence systems that can understand and process both images and text. They are behind many impressive applications, from describing photos to answering questions about visual content. However, these models often struggle with a significant problem known as “hallucination.” This is when a VLM generates information that isn’t actually present in the image, like describing objects that don’t exist or misinterpreting visual details.
A new research paper titled “Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models” explores a novel approach to tackle this hallucination issue. The researchers propose a simple yet intriguing method called “Prompt-in-Image.” Instead of providing text instructions separately from an image, Prompt-in-Image embeds the textual instructions directly into the image itself. This forces the VLM to process all information—both visual and textual—through its visual processing channels, potentially simplifying how the model integrates different types of information.
The core idea behind Prompt-in-Image is to eliminate the need for separate text inputs, making the model rely solely on its visual understanding. This could help overcome challenges related to aligning information from different modalities (vision and language), which is a common source of hallucination in VLMs.
Testing the Waters: Diverse Outcomes Across Models
To evaluate Prompt-in-Image, the researchers tested it on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results were surprisingly divergent, revealing a “cure or poison” effect depending on the model.
For Qwen2.5-VL, Prompt-in-Image proved to be a significant improvement. Its accuracy on the POPE hallucination benchmark increased by 4.1%, and it also showed a reduction in hallucination rates on the MS-COCO dataset. This suggests that for Qwen, embedding instructions visually enhanced its ability to understand images and generate accurate descriptions, even helping it detect small or hidden objects it previously missed.
In stark contrast, LLaVA-1.5 and InstructBLIP experienced a severe performance drop. Their accuracy plummeted from around 84% to near-random levels (around 55% and 54% respectively). These models also showed a strong tendency to default to “yes” for almost all questions, indicating a complete loss of their ability to discriminate.
Unpacking the Differences: Why Some Models Thrive and Others Fail
The researchers conducted a detailed analysis to understand these contrasting outcomes. They found that the vision encoders in LLaVA and InstructBLIP, which are based on CLIP, exhibited an excessive attention bias towards the embedded text regions. This means these models focused too much on the text within the image, disrupting their overall visual understanding and leading to increased hallucination.
On the other hand, Qwen’s vision encoder demonstrated remarkable robustness in handling images with embedded text. This resilience is likely due to Qwen’s diverse pre-training, which includes processing images with naturally embedded text and OCR data. This training helps Qwen treat text as a normal visual element rather than a disruptive signal.
Furthermore, Prompt-in-Image was found to reduce the “modality gap” in Qwen. The modality gap refers to the separation between image and text representations in a VLM’s internal space. By unifying the input through the visual channel, Prompt-in-Image helped Qwen align its visual and textual understanding more closely, leading to improved performance and reduced hallucination.
Also Read:
- Enhancing Vision-Language Models: A New Approach to Prompt Tuning Through Visual Disentanglement
- Improving AI Explanations: CoRGI Introduces Visual Grounding to Chain-of-Thought
Implications for Future VLM Development
This research highlights that the way Vision-Language Models are trained on multimodal data significantly impacts their ability to handle novel input strategies. While embedding instructions directly into images can be highly beneficial for some models like Qwen, it can be detrimental to others like LLaVA and InstructBLIP due to their underlying architectural biases.
The findings suggest that simpler, unified approaches to VLM architecture, where information is processed through a single modality, might be a promising direction for future research. This could lead to more robust and less hallucinatory AI models. For more details, you can read the full research paper here.


