AI Models Battle in Mortal Kombat II: A New Benchmark for Multimodal Intelligence

TLDR: Researchers have developed “LM Fight Arena,” a novel benchmark that evaluates large multimodal models (LMMs) by having them compete in the classic fighting game Mortal Kombat II. This framework assesses LMMs’ real-time visual understanding and sequential decision-making in an adversarial environment. In a round-robin tournament, Claude 3.5 Sonnet emerged as the undefeated champion, demonstrating superior perception-action coupling and strategic reasoning. Gemini 2.5 Pro secured second place, while models like GPT-4o struggled significantly, highlighting the challenge of dynamic decision-making for current LMMs. The benchmark emphasizes the need for evaluation methods that go beyond static tasks to truly test AI capabilities in interactive scenarios.

In the rapidly evolving landscape of artificial intelligence, large multimodal models (LMMs) are making significant strides, integrating visual perception with language understanding to tackle complex tasks. However, traditional benchmarks often fall short in evaluating these models in dynamic, real-time, and adversarial environments, which are crucial for real-world applications.

Introducing LM Fight Arena

To address this critical gap, researchers from Shanghai Jiao Tong University and Shanghai AI Lab have introduced a novel framework called LM Fight Arena. This innovative benchmark evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II. This task demands rapid visual understanding, tactical reasoning, and sequential decision-making, providing a rigorous testbed for AI capabilities.

Unlike static evaluations, LM Fight Arena offers a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning in a dynamic setting. The choice of Mortal Kombat II is deliberate; its structured mechanics, clear health bars, distinct character animations, and an 8-button action space provide a rich yet interpretable environment for AI evaluation.

How the Tournament Works

The evaluation framework is meticulously controlled. All competing agents control the same character, Liu Kang, ensuring a fair comparison by eliminating character-specific advantages. The models receive real-time visual frames from the game emulator, sampled every fourth frame to provide approximately one second of context. These visual inputs are augmented with structured game state information, including health bars, character coordinates, facing direction, and a history of the last five actions. All this information is bundled into a natural-language state description, allowing each LMM to receive a consistent mix of visual and symbolic cues.

Models then output their next actions as natural language commands (e.g., “Left + A” or “Down, Forward, A”), which are parsed and translated into Sega Genesis button presses by a dedicated module. This entire control loop operates for each frame, demanding immediate processing and decision-making.

The Competitors and Results

Six leading LMMs were evaluated in a round-robin tournament: three open-source models (InternVL3-78B-Instruct, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct) and three closed-source models (Claude-3.5-Sonnet, Gemini-2.5-Pro, and GPT-4o). Each model was given an identical system prompt outlining the game’s objective, observation format, and available actions.

The tournament results revealed a clear hierarchy of performance. Claude 3.5 Sonnet emerged as the undisputed champion, completing the round-robin undefeated with a 100% win rate and consistently large health margins against its opponents. Gemini 2.5 Pro secured second place with an 80% win rate, demonstrating strong performance with decisive victories over all open-source models and a narrow loss to Claude.

The Qwen family models occupied the middle ground, with Qwen2.5-VL-72B achieving a 60% win rate, often exploiting GPT-4o’s defensive weaknesses. Qwen2.5-VL-32B finished fourth at 40%. InternVL3-78B struggled offensively, ending with a 20% win rate. Notably, GPT-4o failed to secure a single win, despite its strong performance on static tasks, highlighting a significant gap in its ability to perform in dynamic, real-time environments.

Insights into Multimodal Reasoning

The study found that successful LMMs, like Claude and Gemini, excelled at precise visual parsing combined with rapid temporal reasoning. They consistently tracked opponent states, adjusted button sequences based on adversary actions, and coordinated movements like dashes and blocks, suggesting an ability to reason over short action histories to predict reversals. In contrast, GPT-4o’s zero-win record pointed to a policy that over-indexed on safe, passive responses, leading to exploitable delays.

This benchmark underscores that high linguistic competence, as seen in models like GPT-4o, does not automatically translate to effective closed-loop decision-making in dynamic, adversarial settings. Conversely, Claude’s dominance aligns with its reported fast tool-use abilities, suggesting that latency-aware training is beneficial for game-playing scenarios.

Also Read:

Future Directions and Implications

While the LM Fight Arena provides valuable insights, the researchers acknowledge limitations, such as conducting only a single match per pair and evaluating models in a zero-shot setting without game-specific fine-tuning. Future work aims to enhance statistical robustness by expanding to multi-match series and exploring other fighting game franchises and character archetypes to test strategy transferability.

Despite these constraints, the benchmark has significant implications. Fighting games offer a controlled yet dynamic environment for stress-testing AI decision quality, making LM Fight Arena a potential candidate for standardizing evaluations of embodied AI assistants. The techniques developed for synchronizing multimodal observations with language-based action plans could also transfer to fields like robotics, teleoperation, and human-AI teaming.

LM Fight Arena represents a crucial step forward in evaluating LMMs, bridging the gap between AI assessment and interactive entertainment. It highlights the importance of tight perception-action coupling and dynamic reasoning for the advancement of general-purpose visual intelligence. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Models Battle in Mortal Kombat II: A New Benchmark for Multimodal Intelligence

Introducing LM Fight Arena

How the Tournament Works

The Competitors and Results

Insights into Multimodal Reasoning

Future Directions and Implications

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

STV: Smarter In-Context Learning for Multimodal AI

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates