spot_img
HomeResearch & DevelopmentVeritasFi: An Advanced RAG System for Multi-modal Financial Data

VeritasFi: An Advanced RAG System for Multi-modal Financial Data

TLDR: VeritasFi is a novel hybrid Retrieval-Augmented Generation (RAG) framework designed to enhance financial question answering. It addresses key challenges by incorporating a multi-modal preprocessing pipeline to handle diverse data formats (text, tables, figures), a tripartite hybrid retrieval engine for comprehensive information access (deep search, memory bank, tool use), and a two-stage re-ranking strategy that balances general financial knowledge with rapid company-specific adaptation. Experiments show VeritasFi significantly outperforms existing RAG systems in accuracy and relevance across various financial datasets.

In the rapidly evolving financial sector, accurate and contextually rich insights from vast public disclosures are paramount for informed decision-making and regulatory compliance. However, existing Question Answering (QA) systems powered by Retrieval-Augmented Generation (RAG) often struggle with two significant hurdles: processing the diverse formats of financial data, such as text, tables, and figures, and balancing broad applicability with the need for company-specific adaptation.

A groundbreaking new framework, VeritasFi, emerges as an innovative solution to these challenges. Developed by a collaborative team of researchers from institutions including SimpleWay.AI, University of Toronto, McMaster University, and McGill University, VeritasFi is an adaptable, multi-tiered RAG framework specifically designed for multi-modal financial question answering. The research paper, titled “VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering,” details how this hybrid system significantly enhances financial QA through several key innovations.

Addressing Multi-modal Data Complexity

One of VeritasFi’s core strengths lies in its Context-Aware Knowledge Curation (CAKC) module. This multi-modal preprocessing pipeline is engineered to seamlessly transform heterogeneous data from financial filings—including complex textual descriptions, numerical data from tables, and trends depicted in figures—into a coherent, machine-readable format. This unification is crucial because many traditional RAG pipelines simply linearize documents as flat text, leading to incomplete understanding. The CAKC module also incorporates semantic enhancements like de-duplication of similar chunks, co-reference resolution to improve contextual sufficiency, and metadata generation to attach section-level summaries, ensuring that each piece of information retains its broader context.

Beyond unstructured filings, the CAKC pipeline also generates a High-Frequency Memory Bank. This structured, factual cache stores time-stamped answers to common quantitative questions, curated and verified by financial experts. This allows for rapid lookups, bypassing computationally intensive deep retrieval and generation steps for frequently asked queries.

A Tripartite Approach to Information Retrieval

VeritasFi introduces a Tripartite Hybrid Retrieval (THR) engine that operates in parallel to provide comprehensive answers. After a sophisticated query preprocessing stage that normalizes, disambiguates, and decomposes user queries into self-contained sub-queries, the THR engine routes these requests to three specialized modules:

  • Multi-Path Retrieval: For sub-queries requiring in-depth analysis of unstructured financial filings, this module employs diverse strategies simultaneously. It combines a BM25 Sparse Retriever (lexical search), a Dense Retriever (semantic search using embedded vectors), and a Metadata Retriever (matching queries against LLM-generated summaries). This multi-pronged approach maximizes recall by gathering a broad set of candidate chunks.

  • High-Frequency Memory Look-up: This module provides instantaneous answers to common and recurring financial queries by matching the incoming user query with the pre-compiled, human-verified knowledge base. This significantly reduces latency for high-frequency questions.

  • Tool Use: To address queries requiring real-time information, such as current stock prices or recent corporate actions, this module integrates external APIs. It operates asynchronously, fetching current market data and event-specific information, which is then integrated into the conversational context for final answer synthesis.

Adaptable Re-ranking for Precision

To refine the retrieved candidates and improve precision, VeritasFi employs a two-stage Domain-to-Entity Adaptation Re-ranking (DAR) strategy. This innovative approach ensures the re-ranker is both financially knowledgeable and rapidly adaptable to specific entities:

  • Stage 1: Finance Reranker Training: A general financial re-ranker is initially trained on an abstracted, entity-agnostic dataset. This involves systematically masking entity-specific information (like product models, individual names, and company names) with placeholders. This stage builds a robust, general-purpose model capable of financial reasoning independent of memorized facts.

  • Stage 2: Target Company Adaptation: The general re-ranker is then specialized for a target company. To overcome the manual data creation bottleneck, an automated annotation module uses a Large Language Model (LLM) to label retrieved chunks as relevant or irrelevant. This process ensures the training data distribution mirrors inference-time conditions, enabling rapid, consistent, and scalable generation of high-quality training data for any new company.

Also Read:

Demonstrated Superiority

Extensive experiments on both public benchmarks (FinanceBench, FinQA) and in-house company datasets (Zeekr, Lotus) showcase VeritasFi’s significant outperformance against existing RAG architectures, including GraphRAG and LightRAG. The framework consistently achieves superior factual correctness, response relevancy, context recall, and overall answer quality. The inclusion of the CAKC module provides a substantial performance boost across all retrieval methods, particularly in context recall and factual correctness. The two-stage re-ranking strategy also delivers consistent and substantial improvements, demonstrating the value of both general financial domain competence and target company specialization.

VeritasFi represents a significant leap forward in financial question answering, offering a scalable and robust solution for both general-domain and company-specific QA tasks. Its integrated architecture provides a blueprint for domain-specific RAG systems that must balance broad applicability with rapid specialization, a critical requirement across many sectors beyond finance. For more details, you can refer to the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -