spot_img
HomeNews & Current EventsNew Benchmark Reveals Surprising Similarity in LLM Creative Outputs,...

New Benchmark Reveals Surprising Similarity in LLM Creative Outputs, Underscoring Human Role

TLDR: The first-ever benchmark for evaluating the creativity of Large Language Models (LLMs) in marketing, developed by Springboards in collaboration with leading industry bodies, has found that popular AI tools like ChatGPT, Gemini, and Claude exhibit remarkably similar creative outputs. The study challenges the notion of a ‘best’ AI tool for creative tasks and emphasizes the indispensable role of human creativity and judgment in achieving breakthrough outcomes.

NEW YORK, October 21, 2025 – A groundbreaking study, dubbed the ‘Creativity Benchmark,’ has unveiled a surprising uniformity in the creative outputs of leading Large Language Models (LLMs), suggesting that these advanced AI tools are more alike in their creative capabilities than widely perceived. Conducted by Springboards, an AI platform dedicated to fostering creativity in advertising, in partnership with prominent industry organizations including the 4As, ACA, APG, D&AD, IAA, IPA, and The One Club for Creativity, the research marks the world’s first comprehensive benchmark for evaluating LLM creativity in a marketing context.

The study’s core finding indicates that AI tools such as ChatGPT, Gemini, and Claude perform with striking similarity across various creative tasks. This challenges the prevailing assumption that certain AI models significantly outperform others in generating novel ideas, suggesting instead that agencies and brands should focus less on finding a ‘superior’ AI and more on how these tools are integrated into creative workflows.

According to Pip Bingemann, CEO and co-founder of Springboards, the reason for this convergence lies in the fundamental nature of LLMs. “Everyone assumes some AI tools are way better than others for creative work,” Bingemann stated. “But our tests showed the results were pretty close. Why? Because these models are machines designed to recognize patterns and give you the most probable answer—and ‘probable’ has never been called ‘creative.’ Keeping humans in the loop and optimizing for a wider range of varied ideas is crucial.”

The benchmark evaluated LLMs across three critical types of creative challenges relevant to marketing: uncovering surprising consumer insights, developing expansive campaign ideas, and formulating bold, attention-grabbing concepts. The findings consistently showed that many AI tools tended to suggest similar ideas repeatedly, highlighting a potential limitation in generating true diversity without human intervention.

Further insights from the study underscore the continued necessity of human judgment in the creative process. When AI systems were tasked with evaluating creative ideas, their scores diverged significantly from those provided by human experts. This indicates that relying on AI alone to select the best creative concepts is unreliable, reinforcing the irreplaceable value of human discernment in assessing creative quality and relevance.

Jeremy Lockhorn, SVP, Creative Technologies & Innovation at the 4As, commented on the implications of the research: “LLMs aren’t a one-size-fits-all solution—they’re general purpose tools that require human creativity to unlock breakthrough outcomes. These findings suggest agencies and brands should continue to evaluate which models are best suited for creative work – and that a multi-model approach may well be the best path forward.” Tony Hale, CEO of the Advertising Council Australia, echoed this sentiment, remarking, “This study highlights that creativity isn’t about which AI you use, it’s about how you use it.”

The research also revealed that traditional creativity tests, often employed in psychological studies, do not effectively predict an AI’s performance in marketing-specific creative tasks, emphasizing the need for specialized metrics tailored to brand work. The ‘Creativity Benchmark’ framework itself involved human pairwise preferences from 678 practicing creatives over 11,012 anonymized comparisons, analyzed with Bradley-Terry models, showing a tightly clustered performance where the highest-rated model beat the lowest only about 61% of the time.

Also Read:

Ultimately, the study advocates for expert human evaluation and diversity-aware workflows, positioning AI as a powerful assistant that amplifies human ingenuity rather than replacing it. The findings serve as a critical guide for the advertising and marketing industries as they navigate the evolving landscape of AI-driven creative tools.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -