TLDR: A new study evaluates Centaur, a large language model designed to simulate human cognition. While Centaur demonstrates strong predictive accuracy on tasks it was trained on, its ability to generate realistic, human-like behavior independently falls short. The research highlights that Centaur struggles to reproduce key behavioral patterns in cognitive tasks, including those it was fine-tuned for, and performs poorly on novel tasks. This suggests Centaur is not yet a reliable participant simulator, emphasizing the critical difference between predicting and generating behavior for AI models in cognitive science.
Simulators have transformed scientific research across many fields, allowing scientists to quickly test ideas and improve experimental designs. A prime example is AlphaFold in chemistry, which accurately predicts protein structures, speeding up the development of new drugs and understanding of protein functions. In the behavioral sciences, a similar breakthrough would be a reliable participant simulator—a system that can produce human-like behavior across various cognitive tasks.
Recently, a large language model (LLM) called Centaur was introduced. It was fine-tuned using human data from 160 experiments and proposed as both a model of cognition and a participant simulator for testing experimental studies virtually. This paper reviews the key criteria for a participant simulator and evaluates how well Centaur meets these standards.
A fundamental requirement for any behavioral simulator is its capacity to generate behavioral patterns observed in real experiments. This means it should not only reproduce known effects across different tasks and conditions but also generalize to new, unexplored situations. Such generalization is vital for refining hypotheses, comparing models, or designing experiments in new areas. It’s important to note that a simulator doesn’t necessarily need to explain the underlying mechanisms; its value lies in its ability to produce realistic human-like behavior.
A crucial distinction when evaluating simulators is between predictive and generative performance. A model might be highly accurate at predicting a participant’s next response based on their past actions (predictive performance), but fail to produce believable behavior when acting independently (generative performance). For instance, a simple model that just repeats the previous choice might predict well in a reversal learning task, but when run generatively, it would fail to adapt after a rule change, deviating significantly from human behavior.
While Centaur has shown strong predictive performance, its ability to generate behavior—a critical aspect for both cognitive models and behavioral simulators—had not been extensively tested. This research evaluated Centaur’s predictive and generative performance across three different tasks:
Reversal Learning Task
In this task, participants choose between two options, with reward probabilities switching unexpectedly. Humans typically adapt their choices after such a switch. Centaur showed better predictive performance than other large language models, but its generative behavior exhibited weaker reversal dynamics. In some simulations, Centaur completely failed to adapt its choices after the reward reversal, unlike human behavior.
Horizon-Dependent Bandit Task
This task involves balancing exploration (gathering information) and exploitation (choosing the best-known option) under varying time constraints. Centaur’s predictive performance was comparable to specialized models. However, its generative behavior significantly differed from human data, failing to capture the expected effects of the time horizon manipulation or the typical shift from exploration to exploitation over time.
Also Read:
- Why Large Language Models Can’t Replace Human Participants in Psychological Research
- Decoding Emotional Intelligence in AI: A Cognitive Appraisal Perspective
Wisconsin Card Sorting Test (WCST)
This task, which Centaur was not specifically trained on, requires participants to infer and apply a hidden card-sorting rule and then flexibly switch when the rule changes. Humans often make specific types of errors, such as perseverating (failing to adapt to a new rule) or set-loss errors (failing to maintain the correct rule). In this task, a domain-specific model outperformed Centaur in both predictive and generative measures. Centaur did not achieve human-like accuracy, showing many more perseveration and set-loss errors.
The findings suggest that while Centaur excels at predicting choices on tasks it was trained on, it struggles to reproduce the qualitative, human-like behavior that these tasks are designed to measure. Its limitations are even more apparent on tasks outside its training set. To truly serve as a synthetic participant or an accurate model of cognition, Centaur needs to reliably generate behavior from scratch, not just predict trial-by-trial choices based on human histories. The authors suggest that incorporating mechanistic constraints or developing standardized benchmarks for generative performance could be promising future directions. While Centaur is a significant step, its current inability to faithfully simulate behavior means it cannot yet be considered a “behavioral AlphaFold.”
For more detailed information, you can read the full research paper here.


