LLMs Mimic Human Purchase Intent with Semantic Similarity Rating

Models & Research

The Engineer

15 Oct 2025 · 4 min read

Researchers at PyMC Labs and Colgate-Palmolive have harnessed large language models to create a new method called Semantic Similarity Rating, which accurately gauges purchase intent with high human-like consistency.

In a significant advancement for consumer research, researchers from PyMC Labs and Colgate-Palmolive have developed a method called Semantic Similarity Rating (SSR) to elicit realistic purchase intent ratings using large language models (LLMs). This approach addresses the limitations of traditional survey methods by leveraging LLMs to generate textual responses that are then mapped to Likert scale distributions. The results, detailed in their recent paper, show that SSR achieves 90% of human test–retest reliability while maintaining realistic response distributions (KS similarity > 0.85).

Why It Matters

Consumer research is a costly but essential part of product development, with companies spending billions annually to gather insights from consumer panels. However, these surveys often suffer from biases such as satisficing (respondents providing quick and easy answers), acquiescence (tendency to agree with statements), and positivity bias (overly positive responses). SSR offers a promising alternative by using LLMs to simulate synthetic consumers that provide both quantitative ratings and qualitative feedback.

How SSR Works

Textual Responses from LLMs: Instead of directly asking the model for numerical ratings, which can lead to unrealistic distributions, SSR prompts the LLM to generate textual responses. For example, the model might be asked, "How likely are you to buy this product?" and respond with a sentence like "I would definitely consider it."
Embedding Similarity: These textual responses are then converted into numerical ratings using embedding similarity. The researchers created reference statements for each point on the Likert scale (e.g., "Definitely not," "Probably not," "Not sure," "Probably yes," "Definitely yes"). The model's response is compared to these reference statements, and the closest match determines the rating.
Mapping to Likert Distributions: The similarity scores are used to map the textual responses to a distribution that closely matches human responses. This ensures that the synthetic ratings not only align with human behavior but also provide rich qualitative feedback.

Testing SSR

The researchers tested SSR on an extensive dataset of 57 personal care product surveys conducted by Colgate-Palmolive, involving 9,300 human responses. The results were impressive:

Reliability: SSR achieved 90% of the test–retest reliability observed in human responses. This means that the synthetic ratings are consistent across multiple trials, just like real human responses.
Realism: The response distributions generated by SSR had a Kolmogorov-Smirnov (KS) similarity score greater than 0.85 compared to human data. This indicates that the synthetic ratings closely mimic the variability and distribution of actual human responses.

Implementation Details

Model Architecture: The researchers used state-of-the-art LLMs, such as GPT-4, to generate textual responses. These models are pre-trained on large datasets and fine-tuned for specific tasks.
Embedding Techniques: For embedding similarity, the researchers utilized techniques like BERT or Sentence-BERT, which are known for their effectiveness in capturing semantic meaning.
Benchmarking: The performance of SSR was benchmarked against traditional survey methods using metrics such as test–retest reliability and KS similarity. These benchmarks ensure that the synthetic ratings are not only reliable but also realistic.

Practical Implications

The development of SSR has several practical implications for consumer research:

Scalability: Companies can use LLMs to simulate large numbers of synthetic consumers, making it possible to conduct extensive surveys at a fraction of the cost and time required for traditional methods.
Rich Feedback: In addition to numerical ratings, SSR provides qualitative feedback that can offer deeper insights into consumer preferences and behaviors.
Bias Mitigation: By using LLMs, companies can reduce the biases often present in human surveys, leading to more accurate and reliable data.

Conclusion

The Semantic Similarity Rating (SSR) method represents a significant step forward in consumer research. By leveraging the capabilities of large language models, SSR offers a scalable, reliable, and realistic alternative to traditional survey methods. This framework not only preserves the metrics and interpretability of human surveys but also enhances them with rich qualitative feedback. As companies continue to invest in product development, SSR could become an invaluable tool for gathering accurate and actionable consumer insights.