Nyx: Advancing Retrieval-Augmented Generation for Mixed-Modal Information

TLDR: A new research paper introduces Nyx, a unified mixed-modal retriever designed for Universal Retrieval-Augmented Generation (URAG). URAG aims to improve vision-language generation by retrieving and reasoning over mixed-modal information (text and images). To train Nyx, the researchers created NyxQA, a large-scale dataset of mixed-modal question-answer pairs from web documents. Nyx is trained in two stages: pre-training on diverse datasets and fine-tuning with feedback from Vision-Language Models (VLMs) to align retrieval with generative preferences. Experiments show Nyx significantly outperforms existing methods in both text-only and mixed-modal RAG scenarios, demonstrating its effectiveness in handling complex real-world data.

Large Language Models (LLMs) have transformed how we interact with information, especially through Retrieval-Augmented Generation (RAG). RAG systems enhance LLMs by fetching relevant documents from an external knowledge base, allowing them to provide more accurate and up-to-date responses. However, a significant limitation of most existing RAG systems is their focus on text-only documents. In the real world, information often comes in a ‘mixed-modal’ format, combining text and images, which current systems struggle to handle effectively.

This challenge is at the heart of Universal Retrieval-Augmented Generation (URAG). URAG aims to build systems that can retrieve and reason over information presented in any combination of text and images, ultimately improving the quality of vision-language generation. Imagine asking a question that involves both a picture and some descriptive text, and expecting the system to understand and respond using both modalities. This is the goal URAG seeks to achieve.

Introducing Nyx: A Unified Mixed-Modal Retriever

To address the complexities of URAG, researchers have developed a new unified mixed-modal retriever called Nyx. This system is specifically designed to handle scenarios where both the questions (queries) and the documents in the knowledge base can contain a mix of text and images. Nyx aims to bridge the gap between the diverse ways information is presented in the real world and the capabilities of RAG systems.

Building a Realistic Dataset: NyxQA

One of the biggest hurdles in developing mixed-modal systems is the scarcity of realistic training data. To overcome this, the creators of Nyx introduced NyxQA, a novel dataset built through a four-stage automated pipeline. This pipeline leverages web documents to create a rich collection of mixed-modal question-answer pairs that truly reflect real-world information needs. Unlike older datasets that might focus on specific combinations of modalities, NyxQA supports retrieval and generation involving arbitrarily structured text, images, and their interleaved formats.

The NyxQA construction process involves:

Web Document Sampling: Gathering naturally interleaved image-text documents from large web datasets.
QA Pair Generation: Using powerful Vision-Language Models (VLMs) to generate questions and answers based on these documents, specifically prompting for questions that reference visual content when images are present.
Post-Processing: A multi-step procedure to filter out errors, refine answers for clarity and completeness, and generate plausible incorrect options for multiple-choice questions.
Hard Negative Mining: Identifying challenging ‘negative’ documents that are similar but incorrect, which helps the retriever learn to distinguish truly relevant information.

Nyx’s Two-Stage Training Approach

Nyx is trained using a sophisticated two-stage framework:

Pre-training: In the first stage, Nyx undergoes extensive pre-training on the NyxQA dataset, along with various other open-source retrieval datasets. This stage establishes a broad foundation for general-purpose multimodal retrieval capabilities. It also incorporates Matryoshka Representation Learning (MRL), which allows the model to create compact yet expressive embeddings, enabling flexible trade-offs between retrieval performance and memory efficiency.
VLM-Guided Fine-tuning: The second stage involves supervised fine-tuning. Here, Nyx is refined using feedback from downstream Vision-Language Models (VLMs). This crucial step aligns Nyx’s retrieval outputs with the specific preferences of generative models, ensuring that the retrieved information is not just relevant, but also useful for generating high-quality responses. This feedback-driven approach helps bridge the gap between general retrieval effectiveness and the actual needs of a VLM during generation.

Impressive Performance Across Diverse Tasks

Extensive experiments have shown that Nyx delivers exceptional performance. It not only competes effectively with established models on standard text-only RAG benchmarks but truly shines in the more general and realistic URAG setting. Nyx significantly improves generation quality in vision-language tasks, demonstrating its strong suitability for handling complex mixed-modal information.

For instance, in multimodal tasks like MultimodalQA and NyxQA, Nyx consistently outperforms previous state-of-the-art models. The feedback-driven fine-tuning stage proved particularly impactful, leading to substantial accuracy gains. The research also explored the impact of data scale, showing that more training data leads to better URAG performance, and the effect of retrieved document count, where Nyx maintained robust performance even with fewer documents.

Furthermore, Nyx’s ability to generalize across different generative VLMs, even those it wasn’t specifically fine-tuned with, highlights the robustness of its VLM-guided feedback mechanism. The Matryoshka Representation Learning also allows Nyx to maintain strong performance even when its embedding dimensions are significantly reduced, offering efficiency benefits for real-world applications.

Also Read:

The Future of Retrieval-Augmented Generation

The development of Nyx and the NyxQA dataset marks a significant step forward in the field of Retrieval-Augmented Generation. By pioneering the exploration of URAG and providing a unified retriever optimized for mixed-modal content, this research paves the way for next-generation RAG systems that can truly understand and leverage the rich, diverse information found in the real world. The code for Nyx is publicly available for further research and development. You can find more details in the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Nyx: Advancing Retrieval-Augmented Generation for Mixed-Modal Information

Introducing Nyx: A Unified Mixed-Modal Retriever

Building a Realistic Dataset: NyxQA

Nyx’s Two-Stage Training Approach

Impressive Performance Across Diverse Tasks

The Future of Retrieval-Augmented Generation

Gen AI News and Updates

Enhancing Interpretability and Performance in Vision Transformers with Randomized-MLP Regularization

C3-Diff: Enhancing Spatial Gene Expression Maps with AI and Histology

Bridging Gaps in EEG Emotion Recognition with EMOD

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates