FinCPRG: Advancing Financial Passage Retrieval with a Bidirectional Query Generation Pipeline

TLDR: FinCPRG introduces a novel bidirectional generation pipeline for creating high-quality financial Chinese passage retrieval datasets. It uses LLMs to generate hierarchical queries (intra-document and cross-document) and employs an indirect positives mining method to enrich relevance labels. Constructed from 1.3k Chinese financial reports, the FinCPRG dataset has been validated as an effective benchmark and training resource, significantly improving retrieval model performance in the financial domain.

In the world of information retrieval, finding the right piece of information from a vast collection of documents is crucial. This is especially true in specialized fields like finance, where precise and relevant data can make a significant difference. However, creating high-quality datasets for training and evaluating these retrieval systems has traditionally been expensive and challenging, often lacking the specific nuances of financial language and cross-document relationships.

A new research paper introduces FinCPRG, a novel approach designed to overcome these hurdles. This bidirectional generation pipeline aims to create a rich and comprehensive dataset for financial Chinese passage retrieval, featuring hierarchical queries and detailed relevance labels. The core idea is to leverage the power of large language models (LLMs) to automate and enhance the dataset construction process, moving beyond the limitations of previous methods that often struggled with complex, multi-document queries and consistent quality control.

A Two-Way Approach to Query Generation

The FinCPRG pipeline employs two distinct strategies for generating queries, ensuring a wide range of query types and granularities:

Bottom-Up for Intra-Document Queries: This method focuses on generating queries from within a single financial document. It starts by cleaning and segmenting research reports into smaller chunks. Then, LLMs are prompted to create both sentence-level queries (very specific) and passage-level queries (broader themes) simultaneously. A clever mechanism is also in place to complete ambiguous references, like replacing a generic ‘company’ with a specific company name extracted from the report’s metadata, ensuring the queries are precise and actionable.

Top-Down for Cross-Document Queries: This approach mimics how a human expert might approach a collection of financial reports. It groups report titles based on key financial elements like industry, topic, and time. LLMs are then used to generate ‘topic-level’ queries, representing broader intentions that might span multiple documents. These high-level intentions are further broken down into fine-grained subqueries, guiding the retrieval system to find relevant information across different reports.

Enriching Relevance with Indirect Mining

Beyond just generating queries, FinCPRG also introduces an innovative method for annotating relevance between queries and passages. While direct mapping (where a generated query is directly linked to its source passage) provides a baseline, it often misses other relevant passages. To address this, the pipeline incorporates an ‘indirect positives mining’ method. This involves using a powerful ‘reranker’ model to evaluate the similarity between different query pairs within specific contexts (e.g., queries within the same document or topic cluster). By setting a high similarity threshold, the system identifies additional relevant query-passage pairs, significantly enriching the dataset and reducing the problem of ‘false negatives’ (missing relevant information).

Also Read:

The FinCPRG Dataset and Its Impact

Using this sophisticated pipeline, the researchers constructed the Financial Chinese Passage Retrieval Generated dataset (FinCPRG) from nearly 1,300 Chinese financial research reports. This dataset includes queries at three granularity levels (sentence, passage, and topic) and boasts rich relevance labels, making it a valuable resource for the financial domain.

The quality and effectiveness of FinCPRG were rigorously evaluated. Assessments of the mined relevance labels showed high consistency with human judgments, validating the pipeline’s ability to identify accurate relationships. Furthermore, FinCPRG was tested as both a benchmark and a training dataset. When used as a benchmark, it showed strong correlation with existing financial retrieval benchmarks, confirming its utility for evaluating models. More impressively, when used to fine-tune open-source retrieval models, FinCPRG led to significant performance improvements, especially for models that initially performed less optimally. This highlights the dataset’s potential to enhance retrieval capabilities in low-resource domains.

While the pipeline demonstrates favorable scalability, the authors acknowledge certain limitations, such as the coverage of raw data, the inherent variability in a multi-stage LLM-based system, and potential quality constraints. Future work will focus on addressing these areas to further refine the dataset and pipeline.

This work, detailed in the paper FinCPRG: A Bidirectional Generation Pipeline for Hierarchical Queries and Rich Relevance in Financial Chinese Passage Retrieval, represents a significant step forward in automating the creation of high-quality, domain-specific datasets for information retrieval, particularly in the complex financial sector.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

FinCPRG: Advancing Financial Passage Retrieval with a Bidirectional Query Generation Pipeline

A Two-Way Approach to Query Generation

Enriching Relevance with Indirect Mining

The FinCPRG Dataset and Its Impact

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates