VeritasFi: An Advanced RAG System for Multi-modal Financial Data

TLDR: VeritasFi is a novel hybrid Retrieval-Augmented Generation (RAG) framework designed to enhance financial question answering. It addresses key challenges by incorporating a multi-modal preprocessing pipeline to handle diverse data formats (text, tables, figures), a tripartite hybrid retrieval engine for comprehensive information access (deep search, memory bank, tool use), and a two-stage re-ranking strategy that balances general financial knowledge with rapid company-specific adaptation. Experiments show VeritasFi significantly outperforms existing RAG systems in accuracy and relevance across various financial datasets.

In the rapidly evolving financial sector, accurate and contextually rich insights from vast public disclosures are paramount for informed decision-making and regulatory compliance. However, existing Question Answering (QA) systems powered by Retrieval-Augmented Generation (RAG) often struggle with two significant hurdles: processing the diverse formats of financial data, such as text, tables, and figures, and balancing broad applicability with the need for company-specific adaptation.

A groundbreaking new framework, VeritasFi, emerges as an innovative solution to these challenges. Developed by a collaborative team of researchers from institutions including SimpleWay.AI, University of Toronto, McMaster University, and McGill University, VeritasFi is an adaptable, multi-tiered RAG framework specifically designed for multi-modal financial question answering. The research paper, titled “VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering,” details how this hybrid system significantly enhances financial QA through several key innovations.

Addressing Multi-modal Data Complexity

One of VeritasFi’s core strengths lies in its Context-Aware Knowledge Curation (CAKC) module. This multi-modal preprocessing pipeline is engineered to seamlessly transform heterogeneous data from financial filings—including complex textual descriptions, numerical data from tables, and trends depicted in figures—into a coherent, machine-readable format. This unification is crucial because many traditional RAG pipelines simply linearize documents as flat text, leading to incomplete understanding. The CAKC module also incorporates semantic enhancements like de-duplication of similar chunks, co-reference resolution to improve contextual sufficiency, and metadata generation to attach section-level summaries, ensuring that each piece of information retains its broader context.

Beyond unstructured filings, the CAKC pipeline also generates a High-Frequency Memory Bank. This structured, factual cache stores time-stamped answers to common quantitative questions, curated and verified by financial experts. This allows for rapid lookups, bypassing computationally intensive deep retrieval and generation steps for frequently asked queries.

A Tripartite Approach to Information Retrieval

VeritasFi introduces a Tripartite Hybrid Retrieval (THR) engine that operates in parallel to provide comprehensive answers. After a sophisticated query preprocessing stage that normalizes, disambiguates, and decomposes user queries into self-contained sub-queries, the THR engine routes these requests to three specialized modules:

Multi-Path Retrieval: For sub-queries requiring in-depth analysis of unstructured financial filings, this module employs diverse strategies simultaneously. It combines a BM25 Sparse Retriever (lexical search), a Dense Retriever (semantic search using embedded vectors), and a Metadata Retriever (matching queries against LLM-generated summaries). This multi-pronged approach maximizes recall by gathering a broad set of candidate chunks.
High-Frequency Memory Look-up: This module provides instantaneous answers to common and recurring financial queries by matching the incoming user query with the pre-compiled, human-verified knowledge base. This significantly reduces latency for high-frequency questions.
Tool Use: To address queries requiring real-time information, such as current stock prices or recent corporate actions, this module integrates external APIs. It operates asynchronously, fetching current market data and event-specific information, which is then integrated into the conversational context for final answer synthesis.

Adaptable Re-ranking for Precision

To refine the retrieved candidates and improve precision, VeritasFi employs a two-stage Domain-to-Entity Adaptation Re-ranking (DAR) strategy. This innovative approach ensures the re-ranker is both financially knowledgeable and rapidly adaptable to specific entities:

Stage 1: Finance Reranker Training: A general financial re-ranker is initially trained on an abstracted, entity-agnostic dataset. This involves systematically masking entity-specific information (like product models, individual names, and company names) with placeholders. This stage builds a robust, general-purpose model capable of financial reasoning independent of memorized facts.
Stage 2: Target Company Adaptation: The general re-ranker is then specialized for a target company. To overcome the manual data creation bottleneck, an automated annotation module uses a Large Language Model (LLM) to label retrieved chunks as relevant or irrelevant. This process ensures the training data distribution mirrors inference-time conditions, enabling rapid, consistent, and scalable generation of high-quality training data for any new company.

Also Read:

Demonstrated Superiority

Extensive experiments on both public benchmarks (FinanceBench, FinQA) and in-house company datasets (Zeekr, Lotus) showcase VeritasFi’s significant outperformance against existing RAG architectures, including GraphRAG and LightRAG. The framework consistently achieves superior factual correctness, response relevancy, context recall, and overall answer quality. The inclusion of the CAKC module provides a substantial performance boost across all retrieval methods, particularly in context recall and factual correctness. The two-stage re-ranking strategy also delivers consistent and substantial improvements, demonstrating the value of both general financial domain competence and target company specialization.

VeritasFi represents a significant leap forward in financial question answering, offering a scalable and robust solution for both general-domain and company-specific QA tasks. Its integrated architecture provides a blueprint for domain-specific RAG systems that must balance broad applicability with rapid specialization, a critical requirement across many sectors beyond finance. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

VeritasFi: An Advanced RAG System for Multi-modal Financial Data

Addressing Multi-modal Data Complexity

A Tripartite Approach to Information Retrieval

Adaptable Re-ranking for Precision

Demonstrated Superiority

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates