A Visual Test for AI: How Well Do Large Models Animate Biological Motion?

TLDR: A new evaluation framework, BioMotion Arena, assesses large language models (LLMs) and multimodal large language models (MLLMs) by testing their ability to generate realistic biological motion using point-light displays. The study found that most leading models, including cutting-edge ones, significantly struggle with producing smooth and biologically plausible human movements, highlighting a substantial gap in their understanding of kinetic-geometric patterns. The framework uses human preference data, showing high agreement between crowd and expert evaluations, and effectively distinguishes model strengths.

In the rapidly evolving landscape of artificial intelligence, evaluating the true capabilities of large language models (LLMs) and multimodal large language models (MLLMs) remains a significant challenge. Traditional benchmarks often fall short, providing either numerical scores on static datasets that lack intuitive feedback or relying on subjective textual preferences that can be indistinct and prone to bias.

Addressing this critical need, a new research paper introduces BioMotion Arena, a groundbreaking framework designed to assess these advanced AI models through the lens of visual animation. This novel approach draws inspiration from the remarkable human ability to perceive motion patterns characteristic of living organisms, even from minimal visual cues.

The BioMotion Arena: A Visual Turing Test for AI

At its core, BioMotion Arena utilizes point-light source imaging. This technique, famously used in psychological studies, involves illuminating only the major joints of a moving person. Surprisingly, these simple moving dots are sufficient for humans to perceive a vivid and coherent animation of human movement, distinguishing actions, gender, and even mood. This inherent human visual perception of biological motion serves as the foundation for the new evaluation framework.

The methodology employs a pairwise comparison evaluation. Users are presented with two anonymous animations generated by different large models in response to a prompt (e.g., “A man is walking”). They then vote for the animation they perceive as more realistic and biologically plausible. This lightweight setup allows for efficient collection of diverse prompts and simplifies the comparison process, offering immediate and perceptible feedback on performance differences.

Extensive Evaluation and Key Findings

The researchers conducted extensive experiments, collecting over 45,000 votes from more than 50 human annotators across 53 mainstream LLMs and MLLMs on 90 biological motion variants. The collected human preference data was then translated into a ranking using the Elo score system, commonly used in chess to rank players.

A significant finding from the data analysis is the high agreement between the crowdsourced human votes and those of expert raters, demonstrating the BioMotion Arena’s effectiveness in providing discriminative feedback. However, the results also revealed a substantial gap in current AI capabilities:

Over 90% of the evaluated models, including cutting-edge open-source models like InternVL3 and proprietary models from the Claude-4 series, failed to produce fundamental humanoid point-light groups, let alone smooth and biologically plausible motions.
The average occurrence rate of ‘Both-are-bad’ scenarios was as high as 79.3% for basic motions and 94.8% for fine-grained variants in code-specific comparisons, indicating severe inferiority in understanding biological motion patterns.
A significant performance gap was observed between proprietary models (like Gemini 2.5 Pro, Claude 4 Opus, and OpenAI’s o3) and most open-source models, with proprietary models generally performing better.
Models featuring multi-step reasoning capabilities demonstrated superior performance compared to traditional LLMs, suggesting the potential of such strategies in enhancing models’ understanding of biological movement.
Even the top-performing models received only moderate absolute ratings (scores of three or less on a five-point Likert scale), further validating the challenging nature of the BioMotion Arena and highlighting that a significant gap remains in generating highly realistic biological motion.
The framework proved effective in distinguishing models’ strengths, particularly in actions involving vigorous movements, where models showed varying collapse rates and win rates.

The study also explored the effect of the number of point-lights on motion generation, finding that tasks with either too many or too few points posed greater challenges to the models, further enabling rigorous evaluation.

Also Read:

Implications and Future Outlook

BioMotion Arena serves as a challenging benchmark for performance visualization and a flexible evaluation framework that does not rely on ground-truth data. It offers an intuitive, efficient, and easily perceptible way to compare the performance of large models in generating visual biological motion.

The researchers plan to make BioMotion Arena an open-access online evaluation platform and continuously release the collected human preference data to foster future research and development in this critical area of AI. This work, detailed in the paper “Can Large Models Fool the Eye? A New Turing Test for Biological Animation”, paves the way for a deeper understanding of AI’s ability to grasp and replicate the complexities of human movement.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A Visual Test for AI: How Well Do Large Models Animate Biological Motion?

The BioMotion Arena: A Visual Turing Test for AI

Extensive Evaluation and Key Findings

Implications and Future Outlook

Gen AI News and Updates

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates