spot_img
HomeResearch & DevelopmentAI Models Battle in Mortal Kombat II: A New...

AI Models Battle in Mortal Kombat II: A New Benchmark for Multimodal Intelligence

TLDR: Researchers have developed “LM Fight Arena,” a novel benchmark that evaluates large multimodal models (LMMs) by having them compete in the classic fighting game Mortal Kombat II. This framework assesses LMMs’ real-time visual understanding and sequential decision-making in an adversarial environment. In a round-robin tournament, Claude 3.5 Sonnet emerged as the undefeated champion, demonstrating superior perception-action coupling and strategic reasoning. Gemini 2.5 Pro secured second place, while models like GPT-4o struggled significantly, highlighting the challenge of dynamic decision-making for current LMMs. The benchmark emphasizes the need for evaluation methods that go beyond static tasks to truly test AI capabilities in interactive scenarios.

In the rapidly evolving landscape of artificial intelligence, large multimodal models (LMMs) are making significant strides, integrating visual perception with language understanding to tackle complex tasks. However, traditional benchmarks often fall short in evaluating these models in dynamic, real-time, and adversarial environments, which are crucial for real-world applications.

Introducing LM Fight Arena

To address this critical gap, researchers from Shanghai Jiao Tong University and Shanghai AI Lab have introduced a novel framework called LM Fight Arena. This innovative benchmark evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II. This task demands rapid visual understanding, tactical reasoning, and sequential decision-making, providing a rigorous testbed for AI capabilities.

Unlike static evaluations, LM Fight Arena offers a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning in a dynamic setting. The choice of Mortal Kombat II is deliberate; its structured mechanics, clear health bars, distinct character animations, and an 8-button action space provide a rich yet interpretable environment for AI evaluation.

How the Tournament Works

The evaluation framework is meticulously controlled. All competing agents control the same character, Liu Kang, ensuring a fair comparison by eliminating character-specific advantages. The models receive real-time visual frames from the game emulator, sampled every fourth frame to provide approximately one second of context. These visual inputs are augmented with structured game state information, including health bars, character coordinates, facing direction, and a history of the last five actions. All this information is bundled into a natural-language state description, allowing each LMM to receive a consistent mix of visual and symbolic cues.

Models then output their next actions as natural language commands (e.g., “Left + A” or “Down, Forward, A”), which are parsed and translated into Sega Genesis button presses by a dedicated module. This entire control loop operates for each frame, demanding immediate processing and decision-making.

The Competitors and Results

Six leading LMMs were evaluated in a round-robin tournament: three open-source models (InternVL3-78B-Instruct, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct) and three closed-source models (Claude-3.5-Sonnet, Gemini-2.5-Pro, and GPT-4o). Each model was given an identical system prompt outlining the game’s objective, observation format, and available actions.

The tournament results revealed a clear hierarchy of performance. Claude 3.5 Sonnet emerged as the undisputed champion, completing the round-robin undefeated with a 100% win rate and consistently large health margins against its opponents. Gemini 2.5 Pro secured second place with an 80% win rate, demonstrating strong performance with decisive victories over all open-source models and a narrow loss to Claude.

The Qwen family models occupied the middle ground, with Qwen2.5-VL-72B achieving a 60% win rate, often exploiting GPT-4o’s defensive weaknesses. Qwen2.5-VL-32B finished fourth at 40%. InternVL3-78B struggled offensively, ending with a 20% win rate. Notably, GPT-4o failed to secure a single win, despite its strong performance on static tasks, highlighting a significant gap in its ability to perform in dynamic, real-time environments.

Insights into Multimodal Reasoning

The study found that successful LMMs, like Claude and Gemini, excelled at precise visual parsing combined with rapid temporal reasoning. They consistently tracked opponent states, adjusted button sequences based on adversary actions, and coordinated movements like dashes and blocks, suggesting an ability to reason over short action histories to predict reversals. In contrast, GPT-4o’s zero-win record pointed to a policy that over-indexed on safe, passive responses, leading to exploitable delays.

This benchmark underscores that high linguistic competence, as seen in models like GPT-4o, does not automatically translate to effective closed-loop decision-making in dynamic, adversarial settings. Conversely, Claude’s dominance aligns with its reported fast tool-use abilities, suggesting that latency-aware training is beneficial for game-playing scenarios.

Also Read:

Future Directions and Implications

While the LM Fight Arena provides valuable insights, the researchers acknowledge limitations, such as conducting only a single match per pair and evaluating models in a zero-shot setting without game-specific fine-tuning. Future work aims to enhance statistical robustness by expanding to multi-match series and exploring other fighting game franchises and character archetypes to test strategy transferability.

Despite these constraints, the benchmark has significant implications. Fighting games offer a controlled yet dynamic environment for stress-testing AI decision quality, making LM Fight Arena a potential candidate for standardizing evaluations of embodied AI assistants. The techniques developed for synchronizing multimodal observations with language-based action plans could also transfer to fields like robotics, teleoperation, and human-AI teaming.

LM Fight Arena represents a crucial step forward in evaluating LMMs, bridging the gap between AI assessment and interactive entertainment. It highlights the importance of tight perception-action coupling and dynamic reasoning for the advancement of general-purpose visual intelligence. You can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -