Evaluating LLM Creativity in Marketing: A New Benchmark Reveals Nuances

TLDR: A new study introduces the “Creativity Benchmark,” an evaluation framework for LLMs in marketing creativity. It uses human pairwise preferences from 678 creatives across 100 brands and three prompt types (Insights, Ideas, Wild Ideas). The findings show tightly clustered LLM performance with no single dominant model, weak correlation between LLM-as-judge and human rankings, and that conventional creativity tests don’t fully transfer to brand-constrained tasks. The research emphasizes the need for expert human evaluation and diversity-aware workflows, recommending LLMs as idea accelerators rather than arbiters of creative value.

A new study introduces a comprehensive way to evaluate how creative large language models (LLMs) are when it comes to marketing. This benchmark, called Creativity Benchmark, aims to provide a clearer picture of LLMs’ capabilities in generating marketing insights and ideas, a task that has traditionally been difficult to assess due to its open-ended nature.

The researchers behind this work, Ninad Bhat, Kieran Browne, and Pip Bingemann from Springboards.ai, recognized that existing benchmarks often don’t fully capture the nuances of marketing creativity. Many current evaluation methods focus on general creative writing or rely on LLMs to judge other LLMs, which can introduce biases and may not align with what human marketing experts truly value.

Understanding Marketing Creativity in LLMs

The Creativity Benchmark was designed with real-world marketing and advertising practices in mind. It covers 100 different brands across 12 categories and uses three distinct prompt types: Insights, Ideas, and Wild Ideas. Insights are concise, surprising observations; Ideas are platformable campaign concepts; and Wild Ideas are unconventional, provocative campaign concepts. This broad scope helps ensure the benchmark reflects the diverse challenges faced by marketing professionals.

To get the most accurate evaluation, the study gathered human preferences from 678 practicing creatives. These experts made over 11,012 anonymous comparisons of LLM-generated outputs. The results were then analyzed using a statistical model to determine how different LLMs performed.

Key Findings on Model Performance

One of the most striking findings is that LLM performance in marketing creativity is quite tightly clustered. No single model consistently dominated across all brands or prompt types. The difference between the highest-rated and lowest-rated models was relatively small, meaning the top model would only win about 61% of head-to-head comparisons against the lowest-ranked one. This suggests that even lower-ranked LLMs can produce competitive creative outputs.

Overall, DeepSeek Chat, Claude 3.7 Sonnet, and DeepSeek Reasoner emerged as strong performers. However, their rankings shifted depending on the prompt type. For instance, GPT-4o and GPT-4.5 Preview performed well on Insights, while Claude 3.7 Sonnet excelled in Wild Ideas. The study also noted some geographic variations in preferences, with Australian and UK respondents favoring DeepSeek Chat, and US respondents leaning towards OpenAI O3.

The implication for practitioners is clear: since the performance gaps are small, other factors like cost, latency, ease of integration, and how well a model can be controlled for brand voice might be more important than slight differences in average creative ranking.

The Importance of Diversity

Beyond just quality, the study also looked at model diversity – how varied and distinct the ideas generated by an LLM are for the same prompt. A model that produces many similar ideas, even if good, is less useful than one that offers a wide range of options.

Gemini 2.5 Pro Preview and Claude 3.7 Sonnet consistently showed the highest intra-model diversity, meaning they offered the broadest assortment of alternatives for a given prompt. Conversely, models like Grok 3 Beta and the LLaMA baselines tended to repeat themselves more often. The study also found that simply reframing a prompt, for example, asking for “Wild Ideas” instead of just “Ideas,” reliably increased the divergence and novelty of the outputs.

This highlights that combining models with high inter-model diversity (models that produce very different types of ideas from each other) can significantly expand the range of options without a large increase in cost. For example, pairing Mistral with Gemini 1.5 Pro could provide complementary idea sets.

LLMs as Judges: Not Quite There Yet

The research also investigated the effectiveness of using LLMs to judge other LLM outputs, a common practice to save time and cost. Three different LLM judges (GPT-4o, Claude 4 Sonnet, and Gemini 2.0 Flash) were tested with various judging prompts. The findings were cautionary: LLM-as-judge systems showed weak and inconsistent correlations with human expert judgments. They often displayed their own stable, judge-specific preferences and sometimes unwarranted confidence.

This means that automated judges cannot reliably substitute for human evaluation in marketing creativity. They can serve as a coarse filter, but critical decisions still require expert human oversight.

Conventional Creativity Tests Don’t Fully Transfer

Finally, the study adapted standard human creativity tests, such as the Torrance Tests of Creative Thinking, to evaluate LLMs on general creative abilities outside the marketing domain. These tests typically measure fluency, flexibility, originality, and elaboration.

However, the results showed weak or inconsistent correlations between performance on these conventional tests and human preferences in brand-focused marketing tasks. This suggests that what constitutes “creativity” in an abstract test doesn’t always translate directly to perceived value in a specific marketing context. Domain-specific human ratings remain crucial for applied evaluation.

Also Read:

Conclusion and Recommendations

The Creativity Benchmark provides valuable insights into the current state of LLMs in marketing creativity. The key takeaways are:

No single LLM is a clear winner; performance is tightly clustered.
LLMs are not reliable judges of creative work and cannot replace human experts.
Diversity in ideas is crucial, and some models excel at generating a wider range of options.
Prompt framing can significantly influence the novelty of LLM outputs.

For marketing practitioners, the study offers practical advice: choose LLMs based on factors like brand-voice control, workflow integration, cost, and team preference, rather than chasing small rank differences. Value variation by using LLMs to expand the pool of ideas, and always rely on human judgment for final selection. Experiment with different prompts to unlock more diverse and unconventional ideas. The full research paper can be found here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLM Creativity in Marketing: A New Benchmark Reveals Nuances

Understanding Marketing Creativity in LLMs

Key Findings on Model Performance

The Importance of Diversity

LLMs as Judges: Not Quite There Yet

Conventional Creativity Tests Don’t Fully Transfer

Conclusion and Recommendations

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates