TLDR: The DiCap model introduces a novel framework for prompt learning that leverages diffusion-based counterfactual generation to align prompts with causal features in images. This approach addresses the issue of models learning spurious correlations, improving generalization, especially for unseen categories. By generating ‘what if’ scenarios (counterfactuals) and using them in a contrastive learning setup, DiCap guides prompt embeddings to focus on essential causal attributes, supported by rigorous theoretical guarantees. Experimental results demonstrate its superior performance across image classification, image-text retrieval, and visual question answering tasks.
In the rapidly evolving field of artificial intelligence, prompt learning has emerged as a highly efficient way to adapt large pre-trained models for specific tasks. Unlike traditional methods that require extensive retraining or fine-tuning, prompt learning uses optimized templates to guide models in generating more accurate and relevant outputs. This approach significantly speeds up the process of transferring knowledge from powerful foundation models to new applications.
However, current prompt learning methods face a significant hurdle: they often struggle to identify and focus on the true, underlying characteristics of data. Instead, they can pick up on what are called “spurious correlations”—features that frequently appear together but aren’t causally related to the task at hand. For instance, in an image classification task, a model might learn to associate a camel with a desert background or yurts, rather than focusing on the camel’s unique physical traits like its hump. This leads to a decline in performance when the model encounters images where these non-causal features are different, or when it needs to generalize to new, unseen categories.
To tackle this challenge, researchers have introduced a novel framework called DiCap: Diffusion-based Counterfactual prompt learning. This innovative model aims to make prompt learning more robust and generalizable by aligning prompts with the true causal features within images. At its core, DiCap leverages the power of counterfactual learning, which involves asking “what if” questions. For example, “What would this image look like if a cow were standing where the camel originally stood?” By generating such counterfactual images, the model can learn to distinguish between essential causal features and irrelevant background noise.
Generating high-quality counterfactual images, especially for complex visual data, has traditionally been difficult. Existing methods often rely on a lot of additional information, like semantic similarities or knowledge graphs, and frequently lack a strong theoretical basis to guarantee the accuracy of the generated counterfactuals. DiCap overcomes these limitations by utilizing advanced diffusion models.
Diffusion models are particularly well-suited for this task because they can preserve high-dimensional image features, minimizing information loss. Their iterative sampling process naturally allows for precise changes to specific causal features. Crucially, DiCap is built on rigorous mathematical principles, ensuring the reliability of its counterfactual generation. The process involves taking an original image, progressively adding noise (a process called abduction), and then, guided by an “anti-causal predictor,” reversing this process to generate a counterfactual image where a specific causal factor (like the animal’s identity) has been changed, but non-causal features remain largely similar. This ensures that the generated counterfactual image is the “smallest perturbation” needed to change the label, making it a highly effective negative example for learning.
Once these counterfactual images are generated, DiCap employs a dual contrastive learning framework. In simple terms, it trains the prompt embeddings to be strongly associated with the original, factual images (positive samples) while simultaneously pushing them away from the generated counterfactual images (hard negative samples). This forces the prompts to focus on the stable, causal features that define the object, rather than superficial correlations. The model even selects the “hardest” negative samples by choosing counterfactual labels that are semantically closest to the original, like generating a tiger counterfactual for a cat image, rather than a dog.
The effectiveness of DiCap has been demonstrated through extensive experiments across various visual tasks, including image classification, image-text retrieval, and visual question answering. The results show that DiCap consistently outperforms existing methods, especially in its ability to generalize to unseen categories. For instance, in image classification, DiCap showed an average improvement of 17.6% over the baseline CLIP model on seen classes and 3.87% on unseen classes. It also proved highly stable across different settings and robust to variations in its single hyperparameter.
Also Read:
- LOTUS: A New Framework for Evaluating Advanced Image Captioning
- Enhancing AI Interpretability in Medical Imaging with SPN-Guided Counterfactual Explanations
This research marks a significant step forward in making AI models more robust and reliable by grounding prompt learning in causal principles. While currently focused on single-modal learning, the DiCap framework holds immense potential for adaptation to multi-modal settings in the future. For more in-depth information, you can read the full research paper here.


