Predicting Commercial Memorability: How LLMs Fuse Multi-modal Data for Better Ad Recall

TLDR: A research paper introduces an LLM-based multimodal fusion system, called Gemma Fusion, for predicting commercial (brand) memorability. It integrates visual (ViT) and textual (E5) features, guided by LLM-generated rationale prompts, and uses LoRA for adaptation. Compared to a gradient boosted tree baseline, the LLM system shows greater robustness and generalization on a test set, despite challenges with a small dataset.

In today’s fast-paced digital world, capturing and retaining audience attention is crucial, especially for advertisers. A new research paper by Aleksandar Pramov from the Georgia Institute of Technology delves into the challenging task of predicting how memorable a commercial (brand) will be. This work, presented as part of the MediaEval 2025 workshop competition, introduces an innovative approach using Large Language Models (LLMs) to fuse various types of data for better prediction.

The core problem addressed is “commercial/ad memorability,” which aims to understand how well a brand is remembered from a video. This is a complex task because it involves integrating different data channels—visuals, audio, and text—to predict a subjective human characteristic. The study specifically focuses on commercial videos from the financial industry, utilizing a dataset of 424 curated YouTube commercials.

The Approach: Blending Data with LLMs

The research explores two main modeling architectures. The first is a traditional baseline model called a Histogram Gradient Boosted Tree (HGBT). This model takes numerical metadata, text embeddings (from subtitles, titles, and descriptions), and pre-computed video embeddings as input. While effective on development data, the HGBT model showed significant overfitting when tested on new, unseen data, indicating a lack of generalization.

The more advanced approach, termed “Gemma Fusion,” leverages a Gemma-3 LLM as its central component. This model is designed to integrate multi-modal features by projecting external data streams (like visual and textual embeddings) into the LLM’s embedding space. A key innovation here is the use of LLM-generated “rationale prompts.” These prompts, created by the LLM itself based on expert-derived aspects of memorability (such as brand integration, clarity of messaging, and novelty), guide the fusion model. The model also employs Low-Rank Adaptation (LoRA) to efficiently adapt the LLM for this specific task.

Key Features and Data

The models were fed with several types of features:

Numerical metadata provided by the competition organizers.
E5-base-v2 embeddings of subtitles, titles, and descriptions.
Pre-computed video embeddings (specifically ViT embeddings).
Subtitle summaries and memorability rationales generated by another LLM (gemma3-4b-it-qat), which were then used as embeddings or as part of the prompt text.

Results and Insights

The results highlight a significant advantage of the LLM-based Gemma Fusion system. While the HGBT baseline struggled with overfitting and performed poorly on the final test set, the Gemma Fusion model demonstrated greater robustness and superior generalization performance. This means it was much better at predicting memorability for commercials it hadn’t seen before. The use of LoRA was also found to be beneficial, improving the model’s performance.

Interestingly, the effectiveness of the LLM-generated prompts varied depending on what was being predicted. For “Brand Memorability,” prompts based on expert-like rationales worked best. However, for predicting the general “Memorability Score,” prompts using subtitle summaries yielded better results. This suggests that tailoring the prompt content to the specific memorability aspect is important.

Also Read:

Challenges and Future Directions

A major challenge faced by the researchers was the small size of the training dataset. Despite this, the LLM-multimodal fusion approach managed to improve baseline performance and system stability. Future work aims to address the data scarcity by incorporating additional datasets like memento10k, which could further enhance model stability and performance. Additionally, refining the expert-oriented prompts for memorability and exploring models fine-tuned on financial domain-specific textual data are promising avenues for future research.

This research marks a significant step towards more accurate and robust prediction of commercial memorability, offering valuable insights for marketers and content creators. You can find the full paper and codebase for this innovative work here: LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting Commercial Memorability: How LLMs Fuse Multi-modal Data for Better Ad Recall

The Approach: Blending Data with LLMs

Key Features and Data

Results and Insights

Challenges and Future Directions

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

MAKER System Achieves Million-Step LLM Task with Perfect Accuracy

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates