TLDR: A new evaluation framework, BioMotion Arena, assesses large language models (LLMs) and multimodal large language models (MLLMs) by testing their ability to generate realistic biological motion using point-light displays. The study found that most leading models, including cutting-edge ones, significantly struggle with producing smooth and biologically plausible human movements, highlighting a substantial gap in their understanding of kinetic-geometric patterns. The framework uses human preference data, showing high agreement between crowd and expert evaluations, and effectively distinguishes model strengths.
In the rapidly evolving landscape of artificial intelligence, evaluating the true capabilities of large language models (LLMs) and multimodal large language models (MLLMs) remains a significant challenge. Traditional benchmarks often fall short, providing either numerical scores on static datasets that lack intuitive feedback or relying on subjective textual preferences that can be indistinct and prone to bias.
Addressing this critical need, a new research paper introduces BioMotion Arena, a groundbreaking framework designed to assess these advanced AI models through the lens of visual animation. This novel approach draws inspiration from the remarkable human ability to perceive motion patterns characteristic of living organisms, even from minimal visual cues.
The BioMotion Arena: A Visual Turing Test for AI
At its core, BioMotion Arena utilizes point-light source imaging. This technique, famously used in psychological studies, involves illuminating only the major joints of a moving person. Surprisingly, these simple moving dots are sufficient for humans to perceive a vivid and coherent animation of human movement, distinguishing actions, gender, and even mood. This inherent human visual perception of biological motion serves as the foundation for the new evaluation framework.
The methodology employs a pairwise comparison evaluation. Users are presented with two anonymous animations generated by different large models in response to a prompt (e.g., “A man is walking”). They then vote for the animation they perceive as more realistic and biologically plausible. This lightweight setup allows for efficient collection of diverse prompts and simplifies the comparison process, offering immediate and perceptible feedback on performance differences.
Extensive Evaluation and Key Findings
The researchers conducted extensive experiments, collecting over 45,000 votes from more than 50 human annotators across 53 mainstream LLMs and MLLMs on 90 biological motion variants. The collected human preference data was then translated into a ranking using the Elo score system, commonly used in chess to rank players.
A significant finding from the data analysis is the high agreement between the crowdsourced human votes and those of expert raters, demonstrating the BioMotion Arena’s effectiveness in providing discriminative feedback. However, the results also revealed a substantial gap in current AI capabilities:
-
Over 90% of the evaluated models, including cutting-edge open-source models like InternVL3 and proprietary models from the Claude-4 series, failed to produce fundamental humanoid point-light groups, let alone smooth and biologically plausible motions.
-
The average occurrence rate of ‘Both-are-bad’ scenarios was as high as 79.3% for basic motions and 94.8% for fine-grained variants in code-specific comparisons, indicating severe inferiority in understanding biological motion patterns.
-
A significant performance gap was observed between proprietary models (like Gemini 2.5 Pro, Claude 4 Opus, and OpenAI’s o3) and most open-source models, with proprietary models generally performing better.
-
Models featuring multi-step reasoning capabilities demonstrated superior performance compared to traditional LLMs, suggesting the potential of such strategies in enhancing models’ understanding of biological movement.
-
Even the top-performing models received only moderate absolute ratings (scores of three or less on a five-point Likert scale), further validating the challenging nature of the BioMotion Arena and highlighting that a significant gap remains in generating highly realistic biological motion.
-
The framework proved effective in distinguishing models’ strengths, particularly in actions involving vigorous movements, where models showed varying collapse rates and win rates.
The study also explored the effect of the number of point-lights on motion generation, finding that tasks with either too many or too few points posed greater challenges to the models, further enabling rigorous evaluation.
Also Read:
- Assessing AI’s Geometry Skills: A New Benchmark for Complex Problems
- New Benchmark Reveals AI’s Struggle with Causal Reasoning in Infographics
Implications and Future Outlook
BioMotion Arena serves as a challenging benchmark for performance visualization and a flexible evaluation framework that does not rely on ground-truth data. It offers an intuitive, efficient, and easily perceptible way to compare the performance of large models in generating visual biological motion.
The researchers plan to make BioMotion Arena an open-access online evaluation platform and continuously release the collected human preference data to foster future research and development in this critical area of AI. This work, detailed in the paper “Can Large Models Fool the Eye? A New Turing Test for Biological Animation”, paves the way for a deeper understanding of AI’s ability to grasp and replicate the complexities of human movement.


