Understanding Digital Ad Performance: Multimodal Forecasting with Explanations

TLDR: This research introduces a novel multimodal forecasting framework for digital advertising that predicts click volumes and provides human-interpretable explanations. It combines traditional numerical click data with textual change logs from ad campaigns, using reinforcement learning to enhance text understanding and data fusion. The method, which employs a fine-tuned Large Language Model (LLM) and a Transformer-based time series model, significantly outperforms existing baselines in both prediction accuracy and the quality of its textual reasoning, offering advertisers deeper insights into evolving campaign dynamics.

In the fast-paced world of digital advertising, accurately predicting how many clicks an ad campaign will receive is crucial for both revenue generation and strategic planning. Traditionally, forecasting models have relied solely on numerical data, often missing out on the rich contextual information embedded in textual elements like keyword updates or budget adjustments. A new research paper introduces an innovative approach to tackle this challenge, combining diverse data types to offer more accurate predictions and, importantly, understandable explanations.

The paper, titled “Forecasting Clicks in Digital Advertising: Multimodal Inputs and Interpretable Outputs,” presents a multimodal forecasting framework. This framework integrates historical click data with textual logs from real-world advertising campaigns. The core innovation lies in its use of reinforcement learning (RL) to significantly improve how the system understands textual information and how it combines these different types of data. The result is not just a numerical prediction of future click volumes, but also human-interpretable explanations that shed light on why a particular trend is predicted.

Bridging the Gap: Numerical and Textual Data

Traditional time series forecasting (TSF) models, while effective for numerical data, often overlook the semantic insights hidden in text-based events. Imagine an ad campaign where a sudden drop in clicks occurs. A traditional model might just report the drop, but a multimodal approach could link it to a specific event, such as a major keyword removal or a change in bidding strategy, recorded in the campaign’s change logs. This paper aims to leverage such textual cues, which are often sparse but highly informative.

The researchers collected data from 46 real-world advertisement campaigns, encompassing both numerical time series data and corresponding textual change logs. These logs detail various configuration changes, including budget adjustments, keyword additions/deletions, ad headline modifications, and bid strategy changes. The challenge with this textual data is its sparsity – many days have “no changes” – making it difficult for standard models to utilize directly. To overcome this, the framework uses Large Language Model (LLM) summaries to extract meaningful signals from these sparse texts.

Reinforcement Learning for Smarter Explanations

A key aspect of this research is the application of reinforcement learning to fine-tune an LLM. The LLM is trained to not only predict click trends but also to generate concise, two-sentence textual reasonings for its predictions. A custom reward function guides this training, evaluating three critical components:

Format Compliance: Ensuring the LLM’s output adheres to a specified structure (e.g., using specific tags for reasoning and prediction).
Prediction Accuracy: Rewarding the model when its predicted click trend (increase/decrease) matches the actual outcome.
Reasoning Alignment: Checking if the sentiment inferred from the generated reasoning (e.g., positive for an increase, negative for a decrease) aligns with the actual trend. This prevents the model from generating explanations that contradict its own prediction.

This RL-based fine-tuning, using a method called Group Relative Policy Optimization (GRPO), helps the LLM produce more accurate and logically consistent explanations, even with limited computational resources.

The End-to-End Forecasting Pipeline

The complete multimodal click forecasting pipeline integrates several components:

A Transformer architecture processes the numerical time series data to identify temporal patterns.
The RL-fine-tuned LLM generates textual summaries and predictions from the change logs.
An open-source embedding model (XLM-Roberta) converts these textual summaries into fixed-length numerical embeddings.
A trainable projection layer maps these textual embeddings into a space compatible with the numerical features, amplifying the text’s influence.
Finally, the outputs from the numerical time series model and the textual component are linearly combined to produce the final numerical click forecast.

Demonstrated Effectiveness

Empirical evaluations on a large-scale industry dataset showed significant improvements. The fine-tuned Qwen model, used as the LLM backbone, achieved an 18.38% improvement in prediction accuracy and a 6.69% increase in overall reward score compared to other models. A qualitative example highlighted the model’s ability to correctly interpret the impact of critical campaign changes, such as large-scale keyword removals, leading to accurate predictions that other models missed.

Furthermore, a human evaluation involving five domain experts rated the explanations generated by the fine-tuned Qwen model higher in terms of alignment with ground truth, factual accuracy, and coherence compared to other leading models. The full forecasting pipeline also demonstrated superior performance, achieving lower Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) than all tested baselines, including models using only numerical data or raw change logs.

Also Read:

Looking Ahead

This research marks a significant step forward in multimodal click prediction for digital advertising, being the first to incorporate textual reasoning into time series forecasting. While the current results are promising, the authors note that there’s still potential to enhance the reward function design to generate even more logically consistent and outcome-aligned reasoning. Future work aims to explore more structured reasoning mechanisms within these multimodal frameworks.

For more in-depth technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Understanding Digital Ad Performance: Multimodal Forecasting with Explanations

Bridging the Gap: Numerical and Textual Data

Reinforcement Learning for Smarter Explanations

The End-to-End Forecasting Pipeline

Demonstrated Effectiveness

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates