VIRTUE: Advancing Interactive Image and Text Understanding

TLDR: VIRTUE is a new AI model that allows users to interact with images using visual prompts like points or bounding boxes, in addition to text. It combines a segmentation model with a vision-language model to understand specific regions within an image while still grasping the overall scene. To evaluate this, researchers created a new large-scale benchmark called SCaR. VIRTUE shows significant improvements in both general and visual-interactive tasks, paving the way for more intuitive human-AI interactions.

In the rapidly evolving field of artificial intelligence, models that can understand and process multiple types of data, known as multimodal AI, have shown remarkable progress. Especially, vision-language models (VLMs) have become adept at following instructions and generating embeddings that capture the meaning of both images and text. These embeddings are crucial for tasks like searching for images using text descriptions or vice versa.

However, a significant limitation of current embedding models is their inability to interact visually with users. Imagine wanting to search for a specific object within an image by simply pointing at it, or drawing a box around it. While generative AI models have started exploring these “visual-interactive” capabilities, embedding models have largely overlooked them. This gap means that models often struggle to understand a user’s intent when it’s focused on a particular part of an image, rather than the whole picture.

This is where a new research paper introduces a novel solution: the Visual-InteRactive Text-Image Universal Embedder, or VIRTUE. This innovative model aims to extend the power of segmentation models (which identify and outline objects in images) and vision-language models into the realm of representation learning. By doing so, VIRTUE allows users to specify regions of interest using visual prompts like points, bounding boxes, or masks, enabling the embedder to handle complex and ambiguous scenarios with much greater precision.

Bridging the Gap: How VIRTUE Works

At its core, VIRTUE combines an off-the-shelf segmentation model, specifically SAM-2, with a pre-trained vision-language model. The segmentation model acts as a “visual prompt processor.” When a user provides a visual prompt (like drawing a box around an object), SAM-2 processes this input along with the image to create “segmentation embeddings.” These embeddings capture detailed, entity-level information about the specified region.

For scenarios where no visual prompt is given, VIRTUE intelligently samples points across the image to automatically extract fine-grained entity information. This ensures that even in non-interactive tasks, the model benefits from a richer understanding of individual objects. These segmentation embeddings are then combined with global image embeddings (from the VLM’s vision encoder) and text embeddings. All these different types of information are then fed into a large language model, which produces a single, unified embedding. This unified embedding allows VIRTUE to learn from both visual-interactive and non-visual-interactive data, enabling it to perform entity-aware retrieval while still understanding the broader context of the scene.

A New Benchmark for Interactive AI: SCaR

To properly evaluate VIRTUE’s unique visual-interactive abilities, the researchers also introduced a new, large-scale benchmark called SCaR (Segmentation-and-Scene Caption Retrieval). Existing benchmarks primarily focus on text-based instructions and often simplify visual grounding tasks by cropping out the region of interest, thereby losing the crucial global scene context. SCaR addresses this by challenging models to retrieve the correct text caption for a specified object within its full scene context.

SCaR comprises an impressive 1 million samples, built from five publicly available datasets. A key innovation in SCaR is how it generates challenging “negative” captions. Instead of random sampling, GPT-4V (a powerful AI model) is used to create distractors by intelligently swapping elements (object, relation, or scene) in the ground-truth captions. This process, combined with a meticulous LLM-then-human inspection pipeline, ensures high-quality and diverse negative candidates, pushing models to perform fine-grained, context-aware reasoning.

Also Read:

VIRTUE’s Performance and Impact

Experiments demonstrate that VIRTUE consistently achieves state-of-the-art performance. On 36 universal MMEB tasks, it shows significant improvements ranging from 3.1% to 8.5%. More impressively, on the five visual-interactive SCaR tasks, VIRTUE achieves gains of 15.2% to 20.3%. This indicates that equipping embedding models with visual-interactive capabilities not only benefits interactive scenarios but also enhances their performance in conventional tasks by enriching global context with detailed object-level information.

The practical implications of VIRTUE are far-reaching. It enables new applications such as segment-level retrieval, where users can select a region to find semantically matching images. It also supports “on-the-fly correction” with visual hinting, allowing users to guide the model interactively at inference time without needing to retrain it. For example, if the model misclassifies an object, a user can simply draw a bounding box around the correct object, and VIRTUE can instantly adjust its prediction.

In conclusion, VIRTUE represents a significant advancement in multimodal AI, offering a generic framework for both instruction-following and visual-interactive embedding tasks. The introduction of the SCaR benchmark further opens up new possibilities for human-AI interaction, moving beyond text-only commands to a more intuitive and visually grounded understanding. To learn more about this groundbreaking work, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

VIRTUE: Advancing Interactive Image and Text Understanding

Bridging the Gap: How VIRTUE Works

A New Benchmark for Interactive AI: SCaR

VIRTUE’s Performance and Impact

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates