TLDR: ConDiFi is a new benchmark evaluating Large Language Models (LLMs) in financial scenarios, focusing on both divergent (creative ideation) and convergent (optimal decision-making) thinking. It uses post-training data to avoid contamination and reveals that while some LLMs are fluent, they may lack novelty or actionability in financial foresight. The benchmark highlights models like DeepSeek-R1 and Cohere Command R+ for strong performance in generating actionable insights, providing a more holistic view of LLM capabilities for finance.
Large Language Models (LLMs) are becoming increasingly powerful, but evaluating their true reasoning capabilities, especially in complex fields like finance, remains a significant challenge. Most existing benchmarks focus on factual accuracy or step-by-step logic. However, financial professionals need more than just accurate recall; they must also be able to generate creative, plausible future scenarios under uncertainty (divergent thinking) and then converge on optimal decisions (convergent thinking).
Introducing ConDiFi: A New Benchmark for Financial Reasoning
To address this gap, researchers have introduced ConDiFi, a novel benchmark specifically designed to evaluate both divergent and convergent thinking in LLMs for financial tasks. This benchmark offers a fresh perspective on assessing the reasoning abilities crucial for safely and strategically deploying LLMs in the financial sector.
ConDiFi comprises two main datasets: 607 macro-financial prompts for divergent reasoning and 990 multi-hop adversarial multiple-choice questions (MCQs) for convergent reasoning. A key aspect of ConDiFi is that its data sources are dated after May 2025, minimizing the risk of data contamination from LLMs’ pre-training data and ensuring a true test of domain-specific understanding and cognitive flexibility.
Divergent Thinking: Imagining Plausible Futures
Divergent thinking involves generating multiple, novel possibilities. In finance, this means devising innovative responses to market shifts or crafting risk mitigation strategies. ConDiFi evaluates LLMs on this by prompting them to speculate on how a given financial scenario might evolve, creating branching timelines. These timelines are assessed across five dimensions:
- Plausibility: Whether the timeline adheres to economic logic and historical precedent.
- Novelty: The originality of ideas and their interactions, especially second- and third-order effects.
- Elaboration: The level of detail in each step, including actors, timing, and specific figures.
- Actionable: Whether the timeline can inform real-world investment decisions, such as identifying sectors or specific tickers.
- Richness: An automated metric that measures the structural complexity of the generated timeline, like branching factor and path length, indicating breadth of imagination and depth of causal chains.
The evaluation of 14 leading models revealed interesting differences. Models like Cohere Command A and DeepSeek-R1 consistently performed well across all divergent thinking dimensions, showing strong capabilities in generating plausible, novel, elaborate, and actionable scenarios. In contrast, larger models like GPT-4o sometimes underperformed on Novelty and Actionability, suggesting limitations in speculative reasoning despite their fluency.
Convergent Thinking: Finding the Best Solution
Convergent thinking aims at identifying the best solution under constraints, which is central to logical deduction and precision in finance. For this, ConDiFi uses MCQs where models must select the correct sequence of events that satisfies criteria like factor alignment, temporal coherence, and logical entailment. The questions are made challenging through various adversarial pipelines, including swapping latent drivers, embedding mini-calculations, and creating highly confusable distractors.
The benchmark’s difficulty increases significantly after refinement rounds, pushing modern LLMs to their limits. The Llama series models, particularly llama4_maverick, o1, and llama4_scout, showed exceptional performance in convergent thinking. DeepSeek-R1 also ranked high, reinforcing its strength in reasoning-optimized tasks. Error analysis revealed common issues such as misinterpreting scenario nuances, incorrect prioritization of factors, overlooking critical details, and even a bias towards optimistic outcomes.
Understanding Model Behavior and Complementarity
Beyond just ranking models, the research delves into how different LLMs prioritize various aspects of reasoning. For instance, a strong correlation between Plausibility and Actionability suggests that realism is a prerequisite for actionable insights. However, a weak correlation between Richness and Elaboration indicates that models can elaborate fluently without necessarily producing structurally rich, branching futures.
An inter-model distance analysis showed that DeepSeek-R1 stands out as a distinct outlier in its internal correlation structure, likely due to its unique training approach emphasizing direct reinforcement learning. Conversely, models within the Llama family showed consistent internal structures, reflecting shared architectural designs. Understanding these differences can help in selecting complementary models for ensemble methods, potentially leading to more robust financial analysis.
Also Read:
- Assessing Large Language Models for Financial Auditing Compliance
- Assessing AI’s Capability in Personal Income Tax Calculation
Looking Ahead
ConDiFi represents a significant step towards more cognitively grounded and domain-sensitive evaluation of LLMs. By focusing on reasoning styles essential for high-stakes decision-making in finance, it helps move LLMs beyond being mere fluent explainers to becoming strategic reasoners. Future work will involve releasing the full dataset, integrating human evaluations, and exploring how different prompting and decoding strategies impact model behavior. For more technical details, you can refer to the original research paper: Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios.


