TLDR: VIRTUE is a new AI model that allows users to interact with images using visual prompts like points or bounding boxes, in addition to text. It combines a segmentation model with a vision-language model to understand specific regions within an image while still grasping the overall scene. To evaluate this, researchers created a new large-scale benchmark called SCaR. VIRTUE shows significant improvements in both general and visual-interactive tasks, paving the way for more intuitive human-AI interactions.
In the rapidly evolving field of artificial intelligence, models that can understand and process multiple types of data, known as multimodal AI, have shown remarkable progress. Especially, vision-language models (VLMs) have become adept at following instructions and generating embeddings that capture the meaning of both images and text. These embeddings are crucial for tasks like searching for images using text descriptions or vice versa.
However, a significant limitation of current embedding models is their inability to interact visually with users. Imagine wanting to search for a specific object within an image by simply pointing at it, or drawing a box around it. While generative AI models have started exploring these “visual-interactive” capabilities, embedding models have largely overlooked them. This gap means that models often struggle to understand a user’s intent when it’s focused on a particular part of an image, rather than the whole picture.
This is where a new research paper introduces a novel solution: the Visual-InteRactive Text-Image Universal Embedder, or VIRTUE. This innovative model aims to extend the power of segmentation models (which identify and outline objects in images) and vision-language models into the realm of representation learning. By doing so, VIRTUE allows users to specify regions of interest using visual prompts like points, bounding boxes, or masks, enabling the embedder to handle complex and ambiguous scenarios with much greater precision.
Bridging the Gap: How VIRTUE Works
At its core, VIRTUE combines an off-the-shelf segmentation model, specifically SAM-2, with a pre-trained vision-language model. The segmentation model acts as a “visual prompt processor.” When a user provides a visual prompt (like drawing a box around an object), SAM-2 processes this input along with the image to create “segmentation embeddings.” These embeddings capture detailed, entity-level information about the specified region.
For scenarios where no visual prompt is given, VIRTUE intelligently samples points across the image to automatically extract fine-grained entity information. This ensures that even in non-interactive tasks, the model benefits from a richer understanding of individual objects. These segmentation embeddings are then combined with global image embeddings (from the VLM’s vision encoder) and text embeddings. All these different types of information are then fed into a large language model, which produces a single, unified embedding. This unified embedding allows VIRTUE to learn from both visual-interactive and non-visual-interactive data, enabling it to perform entity-aware retrieval while still understanding the broader context of the scene.
A New Benchmark for Interactive AI: SCaR
To properly evaluate VIRTUE’s unique visual-interactive abilities, the researchers also introduced a new, large-scale benchmark called SCaR (Segmentation-and-Scene Caption Retrieval). Existing benchmarks primarily focus on text-based instructions and often simplify visual grounding tasks by cropping out the region of interest, thereby losing the crucial global scene context. SCaR addresses this by challenging models to retrieve the correct text caption for a specified object within its full scene context.
SCaR comprises an impressive 1 million samples, built from five publicly available datasets. A key innovation in SCaR is how it generates challenging “negative” captions. Instead of random sampling, GPT-4V (a powerful AI model) is used to create distractors by intelligently swapping elements (object, relation, or scene) in the ground-truth captions. This process, combined with a meticulous LLM-then-human inspection pipeline, ensures high-quality and diverse negative candidates, pushing models to perform fine-grained, context-aware reasoning.
Also Read:
- Training AI to See What You Hold: A Narration-Guided Approach for Egocentric Vision
- GroundSight: Enhancing Visual Question Answering with Focused Attention and Hallucination Control
VIRTUE’s Performance and Impact
Experiments demonstrate that VIRTUE consistently achieves state-of-the-art performance. On 36 universal MMEB tasks, it shows significant improvements ranging from 3.1% to 8.5%. More impressively, on the five visual-interactive SCaR tasks, VIRTUE achieves gains of 15.2% to 20.3%. This indicates that equipping embedding models with visual-interactive capabilities not only benefits interactive scenarios but also enhances their performance in conventional tasks by enriching global context with detailed object-level information.
The practical implications of VIRTUE are far-reaching. It enables new applications such as segment-level retrieval, where users can select a region to find semantically matching images. It also supports “on-the-fly correction” with visual hinting, allowing users to guide the model interactively at inference time without needing to retrain it. For example, if the model misclassifies an object, a user can simply draw a bounding box around the correct object, and VIRTUE can instantly adjust its prediction.
In conclusion, VIRTUE represents a significant advancement in multimodal AI, offering a generic framework for both instruction-following and visual-interactive embedding tasks. The introduction of the SCaR benchmark further opens up new possibilities for human-AI interaction, moving beyond text-only commands to a more intuitive and visually grounded understanding. To learn more about this groundbreaking work, you can read the full research paper here.


