A New Framework for Evaluating Financial Information Retrieval in Banking

TLDR: This research introduces a systematic methodology and an LLM-based query generation pipeline to create domain-specific information retrieval benchmarks for financial services. The pipeline generates single and complex multi-document queries, incorporating a reasoning-augmented answerability assessment for high data quality. Using this, the KoBankIR dataset was built for Korean banking. Experiments show that existing retrieval models struggle with these complex queries, especially multi-document and comparative types, highlighting the need for advanced retrieval techniques in the financial sector.

In the rapidly evolving landscape of AI-driven financial services, the ability of large language models (LLMs) to accurately retrieve information is paramount. However, a significant challenge lies in the absence of suitable benchmarks that truly reflect the complex, domain-specific information needs of real-world banking scenarios. Traditional benchmarks often fall short, focusing on structured reports or lacking the multi-document and multi-hop queries common in customer inquiries. Furthermore, the cost and legal restrictions associated with using real customer data make building such benchmarks incredibly difficult.

To address these critical limitations, researchers from Kakaobank—Hyunkyu Kim, Yeeun Yoo, and Youngjun Kwak—have introduced a groundbreaking systematic methodology for constructing domain-specific information retrieval (IR) benchmarks. Their work, detailed in the paper “Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval”, proposes an innovative LLM-based query generation pipeline.

The Query Generation Pipeline

The core of this research is a sophisticated pipeline designed to create realistic and challenging queries. It comprises three main steps:

1. Single-document Query Generation: Initially, the pipeline generates queries for individual passages within banking documents using a powerful LLM like GPT-4o and domain-specific prompts. These queries are then filtered to ensure they are answerable based on their respective passages.

2. Multi-document Query Generation: This is where the pipeline truly shines, mimicking how users seek information across multiple sources. Based on an intensive review of actual customer inquiries, three types of multi-document queries are generated:

Topic-based Merging: Combines two or three single-document queries related to the same financial product into a single, cohesive question.
Context Deepening: Samples multiple query-passage pairs from the same document, allowing the LLM to generate questions that require deeper reasoning across related information.
Comparing and Contrasting: Identifies comparable passages across different products within a financial category to create queries that highlight similarities and differences, such as comparing prepayment penalties for different loan products.

3. Enhanced Answerability Assessment: A crucial component of the pipeline is its reasoning-augmented evaluator. Built on models like DeepSeek-R1-Distill-Qwen, this evaluator guides the model through explicit “Think” steps, significantly improving alignment with human judgments compared to previous automatic scoring methods. This ensures the high quality and reliability of the generated dataset, with a minimal false-positive rate.

Introducing KoBankIR

As a concrete implementation of this methodology, the team constructed KoBankIR, the first Korean-language benchmark specifically designed for banking-domain information retrieval. KoBankIR consists of 815 high-quality queries derived from 204 official banking product disclosures. Unlike existing financial IR datasets, KoBankIR explicitly incorporates complex multi-document queries, reflecting real-world banking interactions where customers often need to synthesize information from various sources.

Also Read:

Experimental Insights and Future Directions

Experiments conducted on KoBankIR using various multilingual retrieval models revealed significant findings:

Existing retrieval models, including sparse, dense, and multi-vector approaches, struggle considerably with the domain-specific and complex multi-document queries in KoBankIR.
Hybrid retrieval strategies, which combine sparse and dense representations (e.g., BGE-M3 Sparse + Dense), generally yield the best performance, balancing lexical matching and semantic understanding. However, even these top-performing models show modest overall results, indicating substantial room for improvement.
Performance degrades as the number of supporting documents for a query increases, underscoring the inherent difficulty of multi-document retrieval.
Queries requiring comparative reasoning (e.g., “Comparing and Contrasting” types) pose a particular challenge for current retrieval models, showing a notable performance drop.

This research not only provides a systematic approach for building high-quality, domain-specific IR benchmarks but also highlights the limitations of current retrieval models in handling the complexities of real-world financial information. The KoBankIR dataset serves as a vital tool for future research, pushing the boundaries for more effective retrieval techniques in the financial domain.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Framework for Evaluating Financial Information Retrieval in Banking

The Query Generation Pipeline

Introducing KoBankIR

Experimental Insights and Future Directions

Gen AI News and Updates

Globee® Awards Unveil Winners of 18th Annual Impact Recognition for 2025

Bairong Inc. and Shanghai Pudong Development Bank Forge AI-Powered Strategic Alliance for Financial Agent Deployment

MUFG Forges Alliance with OpenAI to Revolutionize Banking with Generative AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates