TLDR: A research paper introduces an LLM-based multimodal fusion system, called Gemma Fusion, for predicting commercial (brand) memorability. It integrates visual (ViT) and textual (E5) features, guided by LLM-generated rationale prompts, and uses LoRA for adaptation. Compared to a gradient boosted tree baseline, the LLM system shows greater robustness and generalization on a test set, despite challenges with a small dataset.
In today’s fast-paced digital world, capturing and retaining audience attention is crucial, especially for advertisers. A new research paper by Aleksandar Pramov from the Georgia Institute of Technology delves into the challenging task of predicting how memorable a commercial (brand) will be. This work, presented as part of the MediaEval 2025 workshop competition, introduces an innovative approach using Large Language Models (LLMs) to fuse various types of data for better prediction.
The core problem addressed is “commercial/ad memorability,” which aims to understand how well a brand is remembered from a video. This is a complex task because it involves integrating different data channels—visuals, audio, and text—to predict a subjective human characteristic. The study specifically focuses on commercial videos from the financial industry, utilizing a dataset of 424 curated YouTube commercials.
The Approach: Blending Data with LLMs
The research explores two main modeling architectures. The first is a traditional baseline model called a Histogram Gradient Boosted Tree (HGBT). This model takes numerical metadata, text embeddings (from subtitles, titles, and descriptions), and pre-computed video embeddings as input. While effective on development data, the HGBT model showed significant overfitting when tested on new, unseen data, indicating a lack of generalization.
The more advanced approach, termed “Gemma Fusion,” leverages a Gemma-3 LLM as its central component. This model is designed to integrate multi-modal features by projecting external data streams (like visual and textual embeddings) into the LLM’s embedding space. A key innovation here is the use of LLM-generated “rationale prompts.” These prompts, created by the LLM itself based on expert-derived aspects of memorability (such as brand integration, clarity of messaging, and novelty), guide the fusion model. The model also employs Low-Rank Adaptation (LoRA) to efficiently adapt the LLM for this specific task.
Key Features and Data
The models were fed with several types of features:
- Numerical metadata provided by the competition organizers.
- E5-base-v2 embeddings of subtitles, titles, and descriptions.
- Pre-computed video embeddings (specifically ViT embeddings).
- Subtitle summaries and memorability rationales generated by another LLM (gemma3-4b-it-qat), which were then used as embeddings or as part of the prompt text.
Results and Insights
The results highlight a significant advantage of the LLM-based Gemma Fusion system. While the HGBT baseline struggled with overfitting and performed poorly on the final test set, the Gemma Fusion model demonstrated greater robustness and superior generalization performance. This means it was much better at predicting memorability for commercials it hadn’t seen before. The use of LoRA was also found to be beneficial, improving the model’s performance.
Interestingly, the effectiveness of the LLM-generated prompts varied depending on what was being predicted. For “Brand Memorability,” prompts based on expert-like rationales worked best. However, for predicting the general “Memorability Score,” prompts using subtitle summaries yielded better results. This suggests that tailoring the prompt content to the specific memorability aspect is important.
Also Read:
- Bridging the Modality Gap: New Training Strategies for Balanced AI Reasoning
- Enhancing Trust in Multimodal AI Through Consistent Emotional Explanations
Challenges and Future Directions
A major challenge faced by the researchers was the small size of the training dataset. Despite this, the LLM-multimodal fusion approach managed to improve baseline performance and system stability. Future work aims to address the data scarcity by incorporating additional datasets like memento10k, which could further enhance model stability and performance. Additionally, refining the expert-oriented prompts for memorability and exploring models fine-tuned on financial domain-specific textual data are promising avenues for future research.
This research marks a significant step towards more accurate and robust prediction of commercial memorability, offering valuable insights for marketers and content creators. You can find the full paper and codebase for this innovative work here: LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction.


