spot_img
HomeResearch & DevelopmentA New Framework for Evaluating Financial Information Retrieval in...

A New Framework for Evaluating Financial Information Retrieval in Banking

TLDR: This research introduces a systematic methodology and an LLM-based query generation pipeline to create domain-specific information retrieval benchmarks for financial services. The pipeline generates single and complex multi-document queries, incorporating a reasoning-augmented answerability assessment for high data quality. Using this, the KoBankIR dataset was built for Korean banking. Experiments show that existing retrieval models struggle with these complex queries, especially multi-document and comparative types, highlighting the need for advanced retrieval techniques in the financial sector.

In the rapidly evolving landscape of AI-driven financial services, the ability of large language models (LLMs) to accurately retrieve information is paramount. However, a significant challenge lies in the absence of suitable benchmarks that truly reflect the complex, domain-specific information needs of real-world banking scenarios. Traditional benchmarks often fall short, focusing on structured reports or lacking the multi-document and multi-hop queries common in customer inquiries. Furthermore, the cost and legal restrictions associated with using real customer data make building such benchmarks incredibly difficult.

To address these critical limitations, researchers from Kakaobank—Hyunkyu Kim, Yeeun Yoo, and Youngjun Kwak—have introduced a groundbreaking systematic methodology for constructing domain-specific information retrieval (IR) benchmarks. Their work, detailed in the paper “Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval”, proposes an innovative LLM-based query generation pipeline.

The Query Generation Pipeline

The core of this research is a sophisticated pipeline designed to create realistic and challenging queries. It comprises three main steps:

1. Single-document Query Generation: Initially, the pipeline generates queries for individual passages within banking documents using a powerful LLM like GPT-4o and domain-specific prompts. These queries are then filtered to ensure they are answerable based on their respective passages.

2. Multi-document Query Generation: This is where the pipeline truly shines, mimicking how users seek information across multiple sources. Based on an intensive review of actual customer inquiries, three types of multi-document queries are generated:

  • Topic-based Merging: Combines two or three single-document queries related to the same financial product into a single, cohesive question.
  • Context Deepening: Samples multiple query-passage pairs from the same document, allowing the LLM to generate questions that require deeper reasoning across related information.
  • Comparing and Contrasting: Identifies comparable passages across different products within a financial category to create queries that highlight similarities and differences, such as comparing prepayment penalties for different loan products.

3. Enhanced Answerability Assessment: A crucial component of the pipeline is its reasoning-augmented evaluator. Built on models like DeepSeek-R1-Distill-Qwen, this evaluator guides the model through explicit “Think” steps, significantly improving alignment with human judgments compared to previous automatic scoring methods. This ensures the high quality and reliability of the generated dataset, with a minimal false-positive rate.

Introducing KoBankIR

As a concrete implementation of this methodology, the team constructed KoBankIR, the first Korean-language benchmark specifically designed for banking-domain information retrieval. KoBankIR consists of 815 high-quality queries derived from 204 official banking product disclosures. Unlike existing financial IR datasets, KoBankIR explicitly incorporates complex multi-document queries, reflecting real-world banking interactions where customers often need to synthesize information from various sources.

Also Read:

Experimental Insights and Future Directions

Experiments conducted on KoBankIR using various multilingual retrieval models revealed significant findings:

  • Existing retrieval models, including sparse, dense, and multi-vector approaches, struggle considerably with the domain-specific and complex multi-document queries in KoBankIR.
  • Hybrid retrieval strategies, which combine sparse and dense representations (e.g., BGE-M3 Sparse + Dense), generally yield the best performance, balancing lexical matching and semantic understanding. However, even these top-performing models show modest overall results, indicating substantial room for improvement.
  • Performance degrades as the number of supporting documents for a query increases, underscoring the inherent difficulty of multi-document retrieval.
  • Queries requiring comparative reasoning (e.g., “Comparing and Contrasting” types) pose a particular challenge for current retrieval models, showing a notable performance drop.

This research not only provides a systematic approach for building high-quality, domain-specific IR benchmarks but also highlights the limitations of current retrieval models in handling the complexities of real-world financial information. The KoBankIR dataset serves as a vital tool for future research, pushing the boundaries for more effective retrieval techniques in the financial domain.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -