TLDR: FinCPRG introduces a novel bidirectional generation pipeline for creating high-quality financial Chinese passage retrieval datasets. It uses LLMs to generate hierarchical queries (intra-document and cross-document) and employs an indirect positives mining method to enrich relevance labels. Constructed from 1.3k Chinese financial reports, the FinCPRG dataset has been validated as an effective benchmark and training resource, significantly improving retrieval model performance in the financial domain.
In the world of information retrieval, finding the right piece of information from a vast collection of documents is crucial. This is especially true in specialized fields like finance, where precise and relevant data can make a significant difference. However, creating high-quality datasets for training and evaluating these retrieval systems has traditionally been expensive and challenging, often lacking the specific nuances of financial language and cross-document relationships.
A new research paper introduces FinCPRG, a novel approach designed to overcome these hurdles. This bidirectional generation pipeline aims to create a rich and comprehensive dataset for financial Chinese passage retrieval, featuring hierarchical queries and detailed relevance labels. The core idea is to leverage the power of large language models (LLMs) to automate and enhance the dataset construction process, moving beyond the limitations of previous methods that often struggled with complex, multi-document queries and consistent quality control.
A Two-Way Approach to Query Generation
The FinCPRG pipeline employs two distinct strategies for generating queries, ensuring a wide range of query types and granularities:
Bottom-Up for Intra-Document Queries: This method focuses on generating queries from within a single financial document. It starts by cleaning and segmenting research reports into smaller chunks. Then, LLMs are prompted to create both sentence-level queries (very specific) and passage-level queries (broader themes) simultaneously. A clever mechanism is also in place to complete ambiguous references, like replacing a generic ‘company’ with a specific company name extracted from the report’s metadata, ensuring the queries are precise and actionable.
Top-Down for Cross-Document Queries: This approach mimics how a human expert might approach a collection of financial reports. It groups report titles based on key financial elements like industry, topic, and time. LLMs are then used to generate ‘topic-level’ queries, representing broader intentions that might span multiple documents. These high-level intentions are further broken down into fine-grained subqueries, guiding the retrieval system to find relevant information across different reports.
Enriching Relevance with Indirect Mining
Beyond just generating queries, FinCPRG also introduces an innovative method for annotating relevance between queries and passages. While direct mapping (where a generated query is directly linked to its source passage) provides a baseline, it often misses other relevant passages. To address this, the pipeline incorporates an ‘indirect positives mining’ method. This involves using a powerful ‘reranker’ model to evaluate the similarity between different query pairs within specific contexts (e.g., queries within the same document or topic cluster). By setting a high similarity threshold, the system identifies additional relevant query-passage pairs, significantly enriching the dataset and reducing the problem of ‘false negatives’ (missing relevant information).
Also Read:
- ProKG-Dial: Crafting Specialized AI Conversations with Knowledge Graphs
- Enabling Dynamic Interactions with Graph Databases: A Multi-Turn NL2GQL Framework
The FinCPRG Dataset and Its Impact
Using this sophisticated pipeline, the researchers constructed the Financial Chinese Passage Retrieval Generated dataset (FinCPRG) from nearly 1,300 Chinese financial research reports. This dataset includes queries at three granularity levels (sentence, passage, and topic) and boasts rich relevance labels, making it a valuable resource for the financial domain.
The quality and effectiveness of FinCPRG were rigorously evaluated. Assessments of the mined relevance labels showed high consistency with human judgments, validating the pipeline’s ability to identify accurate relationships. Furthermore, FinCPRG was tested as both a benchmark and a training dataset. When used as a benchmark, it showed strong correlation with existing financial retrieval benchmarks, confirming its utility for evaluating models. More impressively, when used to fine-tune open-source retrieval models, FinCPRG led to significant performance improvements, especially for models that initially performed less optimally. This highlights the dataset’s potential to enhance retrieval capabilities in low-resource domains.
While the pipeline demonstrates favorable scalability, the authors acknowledge certain limitations, such as the coverage of raw data, the inherent variability in a multi-stage LLM-based system, and potential quality constraints. Future work will focus on addressing these areas to further refine the dataset and pipeline.
This work, detailed in the paper FinCPRG: A Bidirectional Generation Pipeline for Hierarchical Queries and Rich Relevance in Financial Chinese Passage Retrieval, represents a significant step forward in automating the creation of high-quality, domain-specific datasets for information retrieval, particularly in the complex financial sector.


