Comparing Leading AI Models for Understanding Financial Reports

TLDR: A study evaluated five major Large Language Models (GPT, Claude, Perplexity, Gemini, DeepSeek) on their ability to analyze financial 10-K reports from top tech companies. Using human judgment, automated metrics, and behavioral analysis, the research found that GPT consistently provided the most coherent, accurate, and relevant answers. While other models had specific strengths (e.g., Gemini for lexical accuracy, DeepSeek for conciseness), GPT emerged as the most reliable for complex financial natural language processing tasks, highlighting the importance of multi-faceted evaluation for LLMs in high-stakes domains.

Large Language Models (LLMs) are rapidly changing how we process and understand information across many industries, especially in finance. These advanced AI systems, trained on vast amounts of text, are becoming increasingly vital for tasks like analyzing financial disclosures, understanding market sentiment, and summarizing complex data. However, a clear and systematic comparison of how different leading LLMs perform in specific financial tasks has been largely unexplored until now.

A recent study, titled Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis, addresses this crucial gap. Conducted by Md Talha Mohsin from the University of Tulsa, this research provides a thorough evaluation of five prominent LLMs: GPT, Claude, Perplexity, Gemini, and DeepSeek. The study focused on their ability to analyze 10-K filings, which are annual reports public companies submit to the U.S. Securities and Exchange Commission (SEC). These reports contain critical qualitative information about a company’s strategy, risks, and competitive position, making them ideal for advanced natural language processing.

How the Study Was Conducted

To evaluate the LLMs, the researchers used 10-K filings from the ‘Magnificent Seven’ technology companies (Apple, Microsoft, Amazon, Alphabet, Nvidia, Meta, and Tesla) over three recent fiscal years (2022, 2023, and 2024). From each filing, specifically the ‘Item 1: Business’ section, a representative text sample was extracted. A set of 10 open-ended, interpretive questions were designed to challenge the LLMs to extract, combine, deduce, and interpret financial information, simulating real-life analytical workflows. Each question was posed in a fresh, isolated chat session to prevent any context from previous conversations influencing the responses.

The evaluation employed a multi-faceted approach:

Human Annotation: Five human experts independently scored the LLM responses on five criteria: Relevance, Completeness, Clarity, Conciseness, and Factual Accuracy, using a 1-to-5 Likert scale.
Automated Metric-Based Evaluation: Quantitative measures like ROUGE (for word and phrase overlap), Jaccard Similarity (for word-level set overlap), and Cosine Similarity (for semantic closeness using Sentence-BERT) were used to compare model outputs against reference responses.
Model Behavior Diagnostics: This involved analyzing the consistency and generalizability of the models by looking at cosine similarity across different models and the variance in responses at the prompt level.

Key Findings

The results offered clear insights into the strengths and weaknesses of each LLM in the financial domain:

Human Evaluation: GPT emerged as the top performer, consistently delivering the most relevant, complete, clear, and factually accurate answers. Claude followed closely, showing high factual reliability. Perplexity provided balanced results without major flaws. DeepSeek was noted for its conciseness but often sacrificed relevance and factual correctness. Gemini, despite its clarity, tended to be verbose and less consistent in factual accuracy.
Automated Metrics: Gemini surprisingly excelled in lexical fidelity metrics (ROUGE and Jaccard), indicating its strong ability to replicate exact words and phrases. However, this lexical precision didn’t always translate to semantic understanding or human-perceived quality. Claude and Perplexity showed better semantic coherence, aligning more closely with GPT’s balanced profile that combines semantic depth with sufficient lexical precision.
Behavioral Diagnostics: GPT and Claude demonstrated high semantic alignment with each other, suggesting similar interpretive frameworks. In contrast, Gemini and DeepSeek showed more variability in their outputs, indicating less consistency across different prompts and over time. The study also found that model consistency could vary depending on the company’s filings, with Microsoft’s reports leading to the most consistent LLM responses, while Amazon’s 2024 prompts showed the least agreement among models.

Also Read:

Implications for Financial Analysis

The study concludes that GPT is the most robust and reliable model for analyzing financial text, excelling across human judgment, automated metrics, and behavioral consistency. While Gemini and Claude offer strengths for specific tasks—Gemini for exact phrase replication and Claude for factual validity—they may lack the overall interpretive flexibility of GPT. DeepSeek and Perplexity, while having their own features, were generally less suitable for high-stakes financial analysis due to trade-offs in depth or consistency.

This research underscores the critical importance of a multi-dimensional evaluation approach when selecting LLMs for sensitive domains like finance. It highlights that performance isn’t just about accuracy but also about the simplicity, adaptability, and consistency of a model’s reasoning. Understanding these nuances can help financial professionals and academics make informed decisions about integrating LLMs into their strategic analysis and information extraction workflows, moving towards a more transparent and responsible use of AI in finance.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Comparing Leading AI Models for Understanding Financial Reports

How the Study Was Conducted

Key Findings

Implications for Financial Analysis

Gen AI News and Updates

Financial Sector Fortifies Against Surging AI-Powered Scams

Anthropic’s Claude AI Expands Financial Capabilities with Excel Integration and Real-Time Data Connectors

Leading Foreign Automakers Secure China’s Nod for In-Car AI Chatbots

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates