TLDR: A study evaluated five major Large Language Models (GPT, Claude, Perplexity, Gemini, DeepSeek) on their ability to analyze financial 10-K reports from top tech companies. Using human judgment, automated metrics, and behavioral analysis, the research found that GPT consistently provided the most coherent, accurate, and relevant answers. While other models had specific strengths (e.g., Gemini for lexical accuracy, DeepSeek for conciseness), GPT emerged as the most reliable for complex financial natural language processing tasks, highlighting the importance of multi-faceted evaluation for LLMs in high-stakes domains.
Large Language Models (LLMs) are rapidly changing how we process and understand information across many industries, especially in finance. These advanced AI systems, trained on vast amounts of text, are becoming increasingly vital for tasks like analyzing financial disclosures, understanding market sentiment, and summarizing complex data. However, a clear and systematic comparison of how different leading LLMs perform in specific financial tasks has been largely unexplored until now.
A recent study, titled Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis, addresses this crucial gap. Conducted by Md Talha Mohsin from the University of Tulsa, this research provides a thorough evaluation of five prominent LLMs: GPT, Claude, Perplexity, Gemini, and DeepSeek. The study focused on their ability to analyze 10-K filings, which are annual reports public companies submit to the U.S. Securities and Exchange Commission (SEC). These reports contain critical qualitative information about a company’s strategy, risks, and competitive position, making them ideal for advanced natural language processing.
How the Study Was Conducted
To evaluate the LLMs, the researchers used 10-K filings from the ‘Magnificent Seven’ technology companies (Apple, Microsoft, Amazon, Alphabet, Nvidia, Meta, and Tesla) over three recent fiscal years (2022, 2023, and 2024). From each filing, specifically the ‘Item 1: Business’ section, a representative text sample was extracted. A set of 10 open-ended, interpretive questions were designed to challenge the LLMs to extract, combine, deduce, and interpret financial information, simulating real-life analytical workflows. Each question was posed in a fresh, isolated chat session to prevent any context from previous conversations influencing the responses.
The evaluation employed a multi-faceted approach:
-
Human Annotation: Five human experts independently scored the LLM responses on five criteria: Relevance, Completeness, Clarity, Conciseness, and Factual Accuracy, using a 1-to-5 Likert scale.
-
Automated Metric-Based Evaluation: Quantitative measures like ROUGE (for word and phrase overlap), Jaccard Similarity (for word-level set overlap), and Cosine Similarity (for semantic closeness using Sentence-BERT) were used to compare model outputs against reference responses.
-
Model Behavior Diagnostics: This involved analyzing the consistency and generalizability of the models by looking at cosine similarity across different models and the variance in responses at the prompt level.
Key Findings
The results offered clear insights into the strengths and weaknesses of each LLM in the financial domain:
-
Human Evaluation: GPT emerged as the top performer, consistently delivering the most relevant, complete, clear, and factually accurate answers. Claude followed closely, showing high factual reliability. Perplexity provided balanced results without major flaws. DeepSeek was noted for its conciseness but often sacrificed relevance and factual correctness. Gemini, despite its clarity, tended to be verbose and less consistent in factual accuracy.
-
Automated Metrics: Gemini surprisingly excelled in lexical fidelity metrics (ROUGE and Jaccard), indicating its strong ability to replicate exact words and phrases. However, this lexical precision didn’t always translate to semantic understanding or human-perceived quality. Claude and Perplexity showed better semantic coherence, aligning more closely with GPT’s balanced profile that combines semantic depth with sufficient lexical precision.
-
Behavioral Diagnostics: GPT and Claude demonstrated high semantic alignment with each other, suggesting similar interpretive frameworks. In contrast, Gemini and DeepSeek showed more variability in their outputs, indicating less consistency across different prompts and over time. The study also found that model consistency could vary depending on the company’s filings, with Microsoft’s reports leading to the most consistent LLM responses, while Amazon’s 2024 prompts showed the least agreement among models.
Also Read:
- Evaluating AI’s Data Science Prowess: A New Benchmark for Real-World Tasks
- Unlocking Choice Modeling with AI: A Study on LLM Capabilities
Implications for Financial Analysis
The study concludes that GPT is the most robust and reliable model for analyzing financial text, excelling across human judgment, automated metrics, and behavioral consistency. While Gemini and Claude offer strengths for specific tasks—Gemini for exact phrase replication and Claude for factual validity—they may lack the overall interpretive flexibility of GPT. DeepSeek and Perplexity, while having their own features, were generally less suitable for high-stakes financial analysis due to trade-offs in depth or consistency.
This research underscores the critical importance of a multi-dimensional evaluation approach when selecting LLMs for sensitive domains like finance. It highlights that performance isn’t just about accuracy but also about the simplicity, adaptability, and consistency of a model’s reasoning. Understanding these nuances can help financial professionals and academics make informed decisions about integrating LLMs into their strategic analysis and information extraction workflows, moving towards a more transparent and responsible use of AI in finance.


