TLDR: A new research benchmark evaluates how Large Language Models (LLMs) can discover and utilize unspoken, or “latent,” user preferences through multi-turn conversations. The benchmark includes tasks like a 20 Questions game, personalized question answering, and text summarization. Findings show that while LLMs can infer these hidden preferences, their success varies significantly based on task complexity, topic, and the number of preferences, highlighting that effective personalization remains an open challenge for building truly adaptive AI systems.
Large Language Models (LLMs) have become incredibly adept at generating text that is broadly relevant across many fields, from healthcare to code generation. However, this very generality can become a hurdle when it comes to personalizing interactions. Imagine asking an LLM for restaurant recommendations or travel plans; users rarely spell out every single preference, leaving much of what they care about “latent” or hidden, waiting to be inferred.
This challenge forms the core of a new research paper titled “Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction” by Ioannis Tsaknakis, Bingqing Song, Shuyu Gan, Dongyeop Kang, Alfredo Garcia, Gaowen Liu, Charles Fleming, and Mingyi Hong. The paper introduces a unified benchmark designed to evaluate how well LLMs can uncover and utilize these hidden user attributes through multi-turn conversations.
The Challenge of Latent Information Discovery
The researchers highlight that while LLMs can produce accurate and coherent responses, these often lack the specific personalization that leads to true user satisfaction. User-specific information is frequently unstated, contextual, or even subconscious. For instance, in a game like 20 Questions, success depends on asking strategic questions to reveal hidden information. The paper draws an analogy: can an LLM play this game with a user’s unstated preferences, asking the right questions to infer what the user values and tailoring its final answer accordingly?
The paper proposes that effective personalization requires “latent information discovery,” moving beyond scenarios where users explicitly state all preferences upfront or the LLM simply requests them. Instead, the most natural and powerful approach is “interactive elicitation,” where the LLM actively asks targeted questions, interprets answers, and adapts its response over multiple turns.
A Unified Benchmark with Three Tasks
To systematically evaluate this capability, the benchmark employs a consistent tri-agent framework: a User (with hidden preferences), an Assistant (the LLM being evaluated), and a Judge (which assesses the alignment of the Assistant’s response with the User’s preferences). The benchmark spans three progressively realistic settings:
- 20 Questions Game: This foundational task isolates the pure reasoning process of latent information discovery. The Assistant must guess a hidden object by asking yes-or-no questions, mimicking the process of uncovering hidden preferences.
- Personalized Question Answering (PQA): This task involves goal-oriented dialogue where the Assistant must provide a personalized answer to a user’s question, aligning with one to three latent preferences (e.g., dietary restrictions for restaurant recommendations).
- Personalized Text Summarization (PTS): Here, the Assistant generates a summary of a given text, guided by the user’s specific summarization preferences (e.g., concise summary, focus on numerical results).
The evaluation measures two key metrics: Success Rate (proportion of instances where all latent preferences are met) and Average Stop Turn (efficiency in reaching success). The benchmark focuses on a “passive-user” setting, where the user only responds when prompted, mirroring real-world interactions where users don’t always volunteer all relevant context.
Key Findings and Insights
The experiments, involving both closed-source (GPT-4o-mini, Claude-3.5-Haiku) and open-source models (Mistral-7B-Instruct, Qwen2.5-7B-Instruct), revealed significant variations in performance (32% to 98% success rates). Here are some main observations:
- Context Matters: Success varies dramatically with task complexity, topic, and the number of hidden attributes. For example, in the 20 Questions game, performance drops sharply when the topic is unknown.
- Topic-Specific Difficulty: In Personalized Question Answering, success rates can differ by as much as 70% across topics, indicating that some domains are inherently harder for LLMs to navigate for personalization.
- Efficiency: While models can elicit preferences, effective discovery often occurs within the first few exchanges, with average stop-turn values remaining modest in most settings.
- Model Performance: Closed-source models generally lead, but open-source models like Qwen2.5-7B-Instruct are often competitive, sometimes even more efficient in specific scenarios.
Understanding Errors
An error analysis categorized failures into “process errors” (during interaction) and “result errors” (in the final output). The most common mistakes were “Preference Reinforcement Error” (uncovering a preference but later neglecting it) and “Preference Dilution Error” (acknowledging preferences but applying them only partially). These highlight a need for LLMs to better maintain and operationalize user preferences throughout a conversation.
Also Read:
- Unlocking Individual Thought: A New Benchmark for Language Models
- Assessing How Well Large Language Models Simulate Human Behavior with SIMBENCH
The Path Forward
The research concludes that while LLMs possess the ability to uncover latent preferences, there’s significant room for improvement, especially for weaker models and more challenging topics. This benchmark provides a crucial foundation for studying latent information discovery, a skill essential for building truly adaptive and user-centered AI systems. You can read the full paper for more details here: Research Paper.


