TLDR: A new study introduces the “Creativity Benchmark,” an evaluation framework for LLMs in marketing creativity. It uses human pairwise preferences from 678 creatives across 100 brands and three prompt types (Insights, Ideas, Wild Ideas). The findings show tightly clustered LLM performance with no single dominant model, weak correlation between LLM-as-judge and human rankings, and that conventional creativity tests don’t fully transfer to brand-constrained tasks. The research emphasizes the need for expert human evaluation and diversity-aware workflows, recommending LLMs as idea accelerators rather than arbiters of creative value.
A new study introduces a comprehensive way to evaluate how creative large language models (LLMs) are when it comes to marketing. This benchmark, called Creativity Benchmark, aims to provide a clearer picture of LLMs’ capabilities in generating marketing insights and ideas, a task that has traditionally been difficult to assess due to its open-ended nature.
The researchers behind this work, Ninad Bhat, Kieran Browne, and Pip Bingemann from Springboards.ai, recognized that existing benchmarks often don’t fully capture the nuances of marketing creativity. Many current evaluation methods focus on general creative writing or rely on LLMs to judge other LLMs, which can introduce biases and may not align with what human marketing experts truly value.
Understanding Marketing Creativity in LLMs
The Creativity Benchmark was designed with real-world marketing and advertising practices in mind. It covers 100 different brands across 12 categories and uses three distinct prompt types: Insights, Ideas, and Wild Ideas. Insights are concise, surprising observations; Ideas are platformable campaign concepts; and Wild Ideas are unconventional, provocative campaign concepts. This broad scope helps ensure the benchmark reflects the diverse challenges faced by marketing professionals.
To get the most accurate evaluation, the study gathered human preferences from 678 practicing creatives. These experts made over 11,012 anonymous comparisons of LLM-generated outputs. The results were then analyzed using a statistical model to determine how different LLMs performed.
Key Findings on Model Performance
One of the most striking findings is that LLM performance in marketing creativity is quite tightly clustered. No single model consistently dominated across all brands or prompt types. The difference between the highest-rated and lowest-rated models was relatively small, meaning the top model would only win about 61% of head-to-head comparisons against the lowest-ranked one. This suggests that even lower-ranked LLMs can produce competitive creative outputs.
Overall, DeepSeek Chat, Claude 3.7 Sonnet, and DeepSeek Reasoner emerged as strong performers. However, their rankings shifted depending on the prompt type. For instance, GPT-4o and GPT-4.5 Preview performed well on Insights, while Claude 3.7 Sonnet excelled in Wild Ideas. The study also noted some geographic variations in preferences, with Australian and UK respondents favoring DeepSeek Chat, and US respondents leaning towards OpenAI O3.
The implication for practitioners is clear: since the performance gaps are small, other factors like cost, latency, ease of integration, and how well a model can be controlled for brand voice might be more important than slight differences in average creative ranking.
The Importance of Diversity
Beyond just quality, the study also looked at model diversity – how varied and distinct the ideas generated by an LLM are for the same prompt. A model that produces many similar ideas, even if good, is less useful than one that offers a wide range of options.
Gemini 2.5 Pro Preview and Claude 3.7 Sonnet consistently showed the highest intra-model diversity, meaning they offered the broadest assortment of alternatives for a given prompt. Conversely, models like Grok 3 Beta and the LLaMA baselines tended to repeat themselves more often. The study also found that simply reframing a prompt, for example, asking for “Wild Ideas” instead of just “Ideas,” reliably increased the divergence and novelty of the outputs.
This highlights that combining models with high inter-model diversity (models that produce very different types of ideas from each other) can significantly expand the range of options without a large increase in cost. For example, pairing Mistral with Gemini 1.5 Pro could provide complementary idea sets.
LLMs as Judges: Not Quite There Yet
The research also investigated the effectiveness of using LLMs to judge other LLM outputs, a common practice to save time and cost. Three different LLM judges (GPT-4o, Claude 4 Sonnet, and Gemini 2.0 Flash) were tested with various judging prompts. The findings were cautionary: LLM-as-judge systems showed weak and inconsistent correlations with human expert judgments. They often displayed their own stable, judge-specific preferences and sometimes unwarranted confidence.
This means that automated judges cannot reliably substitute for human evaluation in marketing creativity. They can serve as a coarse filter, but critical decisions still require expert human oversight.
Conventional Creativity Tests Don’t Fully Transfer
Finally, the study adapted standard human creativity tests, such as the Torrance Tests of Creative Thinking, to evaluate LLMs on general creative abilities outside the marketing domain. These tests typically measure fluency, flexibility, originality, and elaboration.
However, the results showed weak or inconsistent correlations between performance on these conventional tests and human preferences in brand-focused marketing tasks. This suggests that what constitutes “creativity” in an abstract test doesn’t always translate directly to perceived value in a specific marketing context. Domain-specific human ratings remain crucial for applied evaluation.
Also Read:
- Decoding AI’s Writing Style: A Benchmark for Stylistic Variation in LLM-Generated Texts
- Rethinking AI Psychology: Why Traditional Tests Fall Short for Large Language Models
Conclusion and Recommendations
The Creativity Benchmark provides valuable insights into the current state of LLMs in marketing creativity. The key takeaways are:
- No single LLM is a clear winner; performance is tightly clustered.
- LLMs are not reliable judges of creative work and cannot replace human experts.
- Diversity in ideas is crucial, and some models excel at generating a wider range of options.
- Prompt framing can significantly influence the novelty of LLM outputs.
For marketing practitioners, the study offers practical advice: choose LLMs based on factors like brand-voice control, workflow integration, cost, and team preference, rather than chasing small rank differences. Value variation by using LLMs to expand the pool of ideas, and always rely on human judgment for final selection. Experiment with different prompts to unlock more diverse and unconventional ideas. The full research paper can be found here.


