TLDR: The research explores how different prompt designs (scaffolds) affect non-player character (NPC) dialogue in games powered by Large Language Models (LLMs). Through a detective game called “The Interview,” a usability study found players didn’t perceive significant differences between highly constrained and minimally constrained prompts, focusing instead on technical issues. A subsequent synthetic evaluation revealed that scaffolding effects are role-dependent: rigid prompts improved consistency for quest-giver NPCs but reduced improvisational believability for suspect NPCs. The paper introduces “Symbolically Scaffolded Play,” a framework that uses fuzzy, numerical boundaries to stabilize coherence where necessary while preserving improvisation for engaging player experiences.
Large Language Models (LLMs) are rapidly changing how we imagine interactive games, particularly by enabling non-player characters (NPCs) to engage in unscripted, dynamic conversations. This exciting prospect, however, comes with a core design challenge: how much structure should be embedded in the prompts that guide these LLMs to ensure a good player experience?
Researchers Vanessa Figueiredo and David Elumeze from ExplorAI and the Department of Computer Science at the University of Regina, Canada, delved into this question with their paper, Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue. Their work challenges the common assumption that more detailed and constrained prompts automatically lead to better gameplay.
The Interview: A Detective Game for Research
To investigate, the team developed “The Interview,” a voice-based detective game powered by three GPT-4o NPCs. Players take on the role of a detective candidate, interrogating two suspects (Sarah and Mark) while being observed by an Interviewer, who also acts as a quest-giver. This setup allowed the researchers to test different prompting strategies in a realistic game environment.
Usability Study: What Players Actually Notice
The first phase involved a usability study with 10 participants. Players experienced two versions of the game: one with High-Constraint Prompts (HCP), which included detailed symbolic scaffolds and explicit rules for NPC behavior, and another with Low-Constraint Prompts (LCP), offering minimal guidance and more room for improvisation. Surprisingly, the study found no significant experiential differences between the two prompt types. Players were more sensitive to surface-level issues like latency or technical breakdowns rather than the underlying sophistication of the prompts. This suggested that hidden refinements in prompt design often go unnoticed by players.
Synthetic Evaluation: Role-Dependent Scaffolding
Guided by these findings, the researchers redesigned the HCP into a hybrid JSON+RAG (Retrieval-Augmented Generation) scaffold. This new architecture combined structured JSON schemas with a retrieval pipeline to ground dialogue in external knowledge. A synthetic evaluation, using an LLM judge, was then conducted to stress-test these scaffolding strategies at scale. This revealed a crucial insight: the effectiveness of scaffolding is highly dependent on the NPC’s role.
For the Interviewer, who serves as a rule-enforcer and narrative anchor, the JSON+RAG scaffold proved beneficial, leading to more stable and predictable outputs. This consistency is vital for a quest-giver NPC, where contradictions could undermine trust and game progression. However, for the suspect NPCs (Sarah and Mark), who rely on improvisation and surprise to maintain believability in their alibis, the rigid JSON+RAG scaffold actually reduced variation and relevance, making their dialogue feel less spontaneous and believable.
Symbolically Scaffolded Play: A New Framework
These role-specific trade-offs led to the introduction of “Symbolically Scaffolded Play.” This framework extends fuzzy–symbolic scaffolding by proposing that symbolic structures should act as fuzzy, numerical boundaries. This means scaffolds should stabilize coherence precisely where breakdowns would disrupt believability (e.g., for a quest-giver) while preserving improvisational freedom where surprise and variation are essential for engagement (e.g., for suspects).
The framework suggests that NPC behavior can be defined by numerical fuzzy logical ranges (between 0.0 and 1.0) that dynamically adjust based on player input and are stored in a shared memory. For instance, an Interviewer’s “guidance intensity” might increase if players struggle to gather evidence, while a Suspect’s “evasiveness” might decrease with rapport-building inputs.
Also Read:
- Unlocking Survey Data for Large Language Models: The QASU Benchmark
- Advancing Claim Matching with AI Agents and LLM-Generated Prompts
Implications for Game Design and Beyond
The research offers three key design imperatives:
- Design for perceptibility: Prompt refinements only matter if players can feel their impact on interaction quality.
- Balance freedom and constraint: Overly rigid prompts can stifle improvisation, while too little structure risks incoherence. Hybrid, role-tuned scaffolds are key.
- Reposition usability testing: Combining player-centered usability studies with synthetic evaluations provides a comprehensive view of how scaffolds affect experience.
Ultimately, “Symbolically Scaffolded Play” reframes the evaluation of generative AI in games. It moves beyond simply asking if LLMs can produce coherent dialogue to exploring how scaffolding can be strategically designed to make coherence and creativity truly meaningful and engaging for players, not just in games but in other interactive AI systems like tutoring or social simulations.


