TLDR: A new method called Semantic Similarity Rating (SSR) allows Large Language Models (LLMs) to accurately simulate human purchase intent. Instead of direct numerical ratings, LLMs provide free-text responses which are then mapped to Likert scales using semantic similarity. This approach achieves 90% human test-retest reliability and realistic response distributions, offering scalable and cost-effective consumer research with rich qualitative feedback, particularly mirroring human behavior for age and income demographics.
Consumer research is a cornerstone for companies developing new products, guiding crucial decisions before significant investments in production and launch. However, this traditional approach, costing billions annually, often grapples with limitations such as panel biases and difficulties in scaling up. The emergence of large language models (LLMs) has opened a new avenue, offering the potential to simulate synthetic consumers and revolutionize how companies gather insights.
Initially, using LLMs for consumer research presented a significant challenge: when directly asked for numerical ratings, like on a Likert scale (e.g., 1 to 5 for purchase intent), LLMs tended to produce unrealistic response distributions. These distributions were often too narrow, systematically skewed, or simply inconsistent with actual human survey data. This raised questions about the fundamental suitability of LLMs as survey respondents.
A recent research paper, titled “LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings,” introduces a novel method called Semantic Similarity Rating (SSR) that addresses this very issue. The authors argue that the problem isn’t with LLMs themselves, but with the method used to elicit their responses. Instead of asking for a direct number, SSR prompts LLMs to generate free-text statements expressing their purchase intent. These textual responses are then mapped to Likert distributions by comparing their semantic similarity to predefined reference statements using embedding technology.
The effectiveness of the SSR method was rigorously tested on an extensive dataset. This dataset comprised 57 personal care product surveys, originally conducted by a leading corporation in the market, involving a total of 9,300 human responses. The results were highly encouraging: SSR achieved an impressive 90% of human test-retest reliability. This means that the synthetic consumers’ responses were remarkably consistent, mirroring how reliable human responses would be if the survey were repeated. Furthermore, the method successfully maintained realistic response distributions, with a Kolmogorov–Smirnov (KS) similarity greater than 0.85, indicating a strong alignment with human data patterns.
Beyond quantitative metrics, SSR offers an additional significant benefit: rich qualitative feedback. The free-text responses generated by the synthetic consumers provide detailed rationales explaining their ratings. This qualitative data can be invaluable for product development, offering insights into positive features, potential concerns, and underlying value propositions, which often go uncaptured or are minimally expressed in traditional human surveys.
The study also explored how well synthetic consumers mirrored human behavior across different demographic attributes and product characteristics. It found that LLMs, when conditioned on demographic personas, replicated human response patterns relatively well, particularly concerning age and income level. For instance, both younger and older synthetic participants tended to rate purchase intent lower than middle-aged cohorts, a behavior observed in real human data. Similarly, synthetic consumers prompted with budgetary concerns responded with lower purchase intent, consistent with human behavior. However, the replication was less consistent for factors like gender and dwelling region, suggesting areas for further refinement.
Also Read:
- Enhancing Survey Insights: A New Framework for Evaluating Human Responses
- Enhancing Recommendations with LLM Agents: Bridging Reasoning and Scalability
This framework represents a significant step forward for scalable consumer research simulations. It preserves traditional survey metrics and interpretability while overcoming previous limitations of LLMs in generating realistic numerical ratings. Importantly, the SSR method requires no training data or fine-tuning on consumer responses, making it a cost-effective and widely applicable plug-and-play tool. While the method relies on carefully designed reference statements and the performance can be influenced by the choice of embedding model, it establishes a credible foundation for augmenting and accelerating consumer insight generation. For more details, you can read the full paper here: LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings.


