TLDR: A research project called “ROBOPSY PL[AI]” used a public art installation and a role-playing game about the 1936 murder of philosopher Moritz Schlick to study how different Large Language Models (LLMs) present historical events and collective memory. Visitors interacted with various LLMs, and the study found significant differences in how these AIs narrated the history, their factual accuracy, and the emotional tone of their responses. The project highlights the potential of playful methods to engage the public in critically examining AI’s influence on our perception of the past.
A fascinating artistic research project, dubbed “ROBOPSY PL[AI]”, has delved into how Large Language Models (LLMs) curate and present collective memory. This initiative, showcased as a public installation in Vienna during 2025, invited visitors to engage with five different LLMs through a unique role-playing game.
The core of the experiment was a text-based role-playing game centered around the historical murder of Austrian philosopher Moritz Schlick in 1936. Players, cast as time-travelers from 2036, were tasked with investigating the reasons behind Schlick’s death. The LLMs involved included popular models like ChatGPT (GPT 4o and GPT 4o mini), Mistral Large, DeepSeek-Chat, and a locally run Llama 3.1 model.
Interaction with the LLMs was intentionally simplified, using a custom-made input device with only four buttons for choices and a reset button. This design choice was crucial for two main reasons: firstly, to prevent the LLMs from generating overly fantastical or historically divergent narratives, which often happened with free text input; and secondly, to make the game accessible and easily playable for a diverse audience in an exhibition setting.
Each LLM was given the same prompt sheet, instructing it to act as a game master, adhere to historical facts as closely as possible, and incorporate political events of 1936 Vienna. The game was structured with a ten-turn limit, after which players received a summary of their success in uncovering the murder’s motivation. This limit was introduced to maintain narrative focus and create a sense of urgency for the players.
Qualitative analysis of the gameplay revealed several intriguing aspects. Players experienced what the researchers termed “fluctuating agency,” where the scope and logic of their actions were constantly modified by the LLM, making it difficult to predict outcomes. While most LLMs correctly identified Johann Nelböck as Schlick’s murderer, they sometimes introduced historically inaccurate or entirely invented characters. More significantly, the LLMs differed in how they presented the motives for the murder. For instance, ChatGPT often emphasized the influence of right-wing ideology on Nelböck, whereas Grok and Mistral tended to downplay this, focusing more on Nelböck’s mental health and personal grievances.
This divergence highlighted a critical point: when prompted to act as critics, the LLMs adopted a fact-checking, positivistic approach to history, a method long questioned by academic historians for its lack of interpretation. Ironically, ChatGPT, in its role-playing, offered an implicit interpretation by stressing the political climate, moving beyond mere factual presentation.
User feedback from the exhibition was diverse, categorizing players into three main groups: those interested in content/style differences between LLMs, those focused on the political relevance of the play, and art lovers curious about AI in art. Many visitors, including those new to LLMs, found the comparison between different models particularly enlightening. One young woman reported a profound experience of inadvertently being led into a “fascist role” by the game, leading her to reflect on “false memories.” An elderly, initially skeptical visitor also changed their perspective on LLMs and critical media art after playing.
Quantitative analysis of 115 introductory texts generated by the LLMs further underscored these differences. Semantic similarity analysis showed that Llama 3.1’s introductions were distinctly different from other models. Named Entity Recognition revealed varying frequencies of historical figures mentioned; for example, “Schlick” appeared in 71 of 115 intros, but never in Gemini 2.5’s. Llama 3.1 also exhibited a tendency to hallucinate historical figures who were either dead or not present in Vienna at the time.
Sentiment analysis using VADER scores indicated that while most intros were neutral, DeepSeek and Claude conveyed a negative sentiment, contrasting with the positive scores from Mistral-Large and GPT 4o. This suggests inherent differences in the emotional tone of the narratives generated by different LLMs.
Also Read:
- Unpacking AI Ethics: How LLMs Navigate Moral Dilemmas Through Debate
- When AI Remembers: How Personalization Can Introduce Bias in Emotional Understanding
The study concludes that this artistic role-playing game effectively demonstrated significant differences in how various LLMs present historical events, both in terms of semantic content and sentiment. These findings challenge the common perception of LLMs as uniformly biased and highlight the importance of understanding their diverse outputs. The project also successfully engaged a broad audience in critical discussions about AI’s impact on our understanding of history and collective memory. The next phase of this research will explore the broader societal implications of AI reshaping collective memory. For more details, you can read the full paper here.


