Evaluating LLMs in Live Financial Markets: Introducing LiveTradeBench

TLDR: LiveTradeBench is a new platform for evaluating large language models (LLMs) in real-time stock and prediction markets. It uses live data, multi-asset portfolio management, and multi-market evaluation. Findings show that general LLM scores don’t predict trading success, models have distinct trading styles, and some LLMs effectively use live signals. The research highlights a gap between static benchmarks and real-world financial competence, advocating for more dynamic evaluation methods.

Large language models (LLMs) have shown impressive capabilities in various benchmarks, from answering knowledge quizzes to solving complex math problems and acting as web agents. However, these evaluations often take place in static, controlled environments that lack the real-world dynamics and uncertainties of live situations. This means they primarily test isolated reasoning or problem-solving skills, rather than the crucial ability to make decisions under unpredictable, evolving conditions.

To bridge this gap, researchers Haofei Yu, Fenghai Li, and Jiaxuan You from the University of Illinois, Urbana-Champaign, have introduced LiveTradeBench. This innovative platform is a live trading environment specifically designed to evaluate LLM agents in realistic and constantly changing financial markets. LiveTradeBench is built on three core principles to ensure a true-to-life assessment of LLM trading capabilities.

Key Design Principles of LiveTradeBench

First, it incorporates live data streaming of market prices and news. This eliminates the reliance on historical backtesting, which can suffer from information leakage and fail to capture real-time uncertainty. By using live data, LiveTradeBench ensures that LLM agents are making decisions based on the most current information, just like human traders.

Second, the platform uses a portfolio-management abstraction. Instead of focusing on simple buy/sell/hold actions for a single asset, LiveTradeBench extends control to multi-asset allocation. This means agents must consider risk management and how different assets interact with each other, reflecting the complexities of real-world investment portfolios.

Third, LiveTradeBench offers multi-market evaluation. It assesses LLM agents across two structurally distinct environments: U.S. stock markets and Polymarket prediction markets. These markets differ significantly in terms of volatility, liquidity, and how information flows, providing a comprehensive test of an agent’s adaptability and generalization across diverse financial landscapes.

In this environment, at each step, an LLM agent observes current prices, relevant news, and its own portfolio status. It then outputs percentage allocations for its assets, aiming to balance risk and potential returns. The researchers conducted 50-day live evaluations of 21 different LLMs from various families to understand their performance.

Surprising Findings from Live Evaluations

The results of these extensive evaluations revealed several important insights. Firstly, models with high scores on general LLM benchmarks like LMArena did not necessarily translate into superior trading outcomes. This suggests that general reasoning ability doesn’t automatically imply competence in dynamic, real-world financial decision-making.

Secondly, the models displayed distinct portfolio management styles. Some LLMs adopted conservative strategies with lower volatility and smaller drawdowns, prioritizing stability. Others exhibited more risk-seeking behaviors, accepting higher volatility in pursuit of greater returns. These styles were consistent across both stock and prediction markets, indicating inherent preferences in the models’ decision-making.

Thirdly, some LLMs demonstrated an effective ability to leverage live market and news signals to adapt their trading decisions. This highlights their potential to process and react to real-time information, a critical skill for successful trading.

The study also found that trading performance in one market (e.g., U.S. stocks) did not necessarily generalize to another (e.g., Polymarket), emphasizing the need for market-specific strategies. Prediction markets, with their faster dynamics and higher volatility, demanded more agile and risk-tolerant approaches compared to the more stable stock market.

How LLM Agents Reason and Decide

To understand if LLM agents were simply making random guesses, the researchers developed a “rolling-k delta” analysis. This showed that delaying an agent’s actions systematically harmed performance, confirming that their strategies depend on contemporaneous market signals and are not random. More frequent rebalancing generally improved performance, especially in the fast-moving Polymarket.

An analysis of the agents’ reasoning processes revealed that news was the most frequently cited factor in their explanations, followed by market price history. Position information was less dominant. Interestingly, Polymarket agents relied more heavily on news, while stock market agents emphasized price trends, validating the hypothesis that these markets have distinct dynamics. Many decisions integrated multiple information sources, indicating complex reasoning.

Also Read:

Real-World Examples

Case studies further illustrated these points. In the U.S. stock market, agents collectively reduced cash holdings during a tech stock rally, aligning with aggressive investment. Conversely, during a market drawdown, they increased cash positions to mitigate risk, demonstrating a defensive stance. For instance, Gemini-2.5-Pro, which maintained a high cash position, experienced the smallest loss during a downturn.

In the Polymarket, evaluating a “Russia × Ukraine ceasefire in 2025?” market, agents sometimes reacted to optimistic news without actual market movement, leading to unprofitable trades. This showed a challenge in distinguishing between attention-grabbing but non-decisive news and genuinely influential events. However, on another occasion, when significant diplomatic news broke, agents strategically held their “Yes” positions, leading to profits as the market price steadily increased.

LiveTradeBench represents a significant step forward in evaluating LLM agents in dynamic, uncertain, and real-world trading environments. It exposes a crucial gap between static evaluations and real-world competence, paving the way for the development of more adaptive, financially grounded, and socially intelligent agent systems. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLMs in Live Financial Markets: Introducing LiveTradeBench

Key Design Principles of LiveTradeBench

Surprising Findings from Live Evaluations

How LLM Agents Reason and Decide

Real-World Examples

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates