TLDR: A new research paper argues that evaluating AI, especially LLMs, with human-designed tests (like IQ or personality tests) is flawed. These tests are calibrated for human cognition and context, and applying them to AI leads to misinterpretations of AI capabilities. The paper highlights issues like invalidity, lack of measurement invariance, and the risk of anthropomorphizing AI, which can obscure liability. It calls for developing new, principled, AI-specific evaluation frameworks that leverage AI’s unique properties and adopt rigorous measurement science principles.
A new research paper challenges the common practice of evaluating Artificial Intelligence (AI) models, particularly Large Language Models (LLMs), using tests originally designed for humans. The paper, titled “Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead”, argues that this approach leads to fundamental misunderstandings about AI capabilities and calls for the creation of new, tailored evaluation frameworks.
The Core Argument: An Ontological Error
Authored by Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi, the paper highlights what they term an “ontological error.” This error occurs when the impressive performance of LLMs on human-centric tests – such as standardized exams like the GRE or SAT, IQ tests, or personality inventories – is interpreted as evidence that these AI models possess human-like intelligence or personality. The authors use a simple analogy: placing a heart rate monitor on a robot arm might return a value, but a robot has no pulse, so the measurement’s meaning is different from that for a human. Similarly, AI models do not have human brains, bodies, or social contexts, making direct comparisons misleading.
Why Human Tests Fall Short for AI
The researchers explain that human psychological and educational tests are not just collections of questions. They are carefully developed measurement instruments, grounded in theories of human cognition, embodiment, and social context, and calibrated to specific human populations. When these tests are applied to non-human subjects like LLMs without proper empirical validation, their validity as measurement tools is compromised.
A key issue is “measurement invariance.” This concept ensures that a test measures the same thing across different groups. The paper demonstrates that the way test items relate to underlying traits (known as “factor loadings”) can differ significantly between humans and LLMs. For instance, an item designed to measure “open-mindedness” in humans might not capture the same variance in LLM responses, rendering the scores incomparable. This means that even if an LLM scores highly on a personality test, it doesn’t necessarily mean it possesses human-like personality; the test might be measuring something entirely different in the AI.
Furthermore, the paper points out that many AI benchmarks, while useful for tracking engineering progress, have shifted their language to claim they “measure general intelligence.” The authors argue that correlations between these benchmarks might simply reflect factors like model size or the amount of training data, rather than a distinct form of “intelligence.” They also criticize the lack of established standards for AI evaluation, contrasting it with the rigorous frameworks found in fields like psychometrics.
Also Read:
- AI Evaluating AI: A Benchmark-Free Method for LLM Assessment
- Evaluating LLMs: Why Different Voices Matter in Benchmarking
Risks and Opportunities
The misinterpretation of AI capabilities based on human tests carries significant risks. It can lead to “false certification,” where the public or even professionals place undue trust in AI systems, as seen in cases where lawyers cited fake, AI-generated legal cases. It also contributes to “anthropomorphization,” the tendency to attribute human traits to AI, which can obscure liability and allow AI creators to shift blame for harmful outputs onto the models themselves.
Despite these challenges, the paper identifies a significant opportunity. It calls for a new research frontier at the intersection of machine learning, psychometrics, and econometrics to develop principled, AI-specific measurement models. These new frameworks should be grounded in falsifiable theories, leverage the unique properties of AI systems (such as the ability to directly probe causal relationships through interventions), and adopt the methodological rigor of established measurement sciences.
In conclusion, while benchmarking remains a valuable tool for driving progress in machine learning, the authors strongly advocate for a shift in perspective: stop evaluating AI with human tests, and instead, develop principled, AI-specific tests that truly reflect the unique nature and capabilities of these advanced systems.


