TLDR: A new research method, FIXLIP, uses game theory and weighted Banzhaf interactions to explain how vision-language models like CLIP determine image-text similarity. Unlike previous methods, FIXLIP focuses on complex cross-modal interactions, offering more faithful and efficient explanations. It also introduces new evaluation metrics and proves useful for comparing different model architectures, helping developers debug and understand these advanced AI systems.
Vision-language models, such as CLIP and SigLIP, have transformed computer vision by enabling capabilities like zero-shot classification, multimodal retrieval, and semantic understanding. These powerful models learn to predict the similarity between images and text, becoming crucial components in advanced AI systems, including those used in sensitive areas like medical imaging.
However, understanding *why* these models make certain similarity predictions has been a challenge. Traditional explanation methods, often relying on ‘saliency maps,’ primarily focus on individual parts of an image or text. They fall short because they don’t capture the intricate, two-way interactions between visual elements and textual descriptions, which are fundamental to how these models operate.
Introducing FIXLIP: A Game-Theoretic Approach to Explanations
A new research paper introduces a novel approach called Faithful Interaction Explanations of LIP models (FIXLIP). This method offers a unified way to break down the similarity predictions in vision-language encoders. FIXLIP is grounded in game theory, treating the input image and text tokens as ‘players’ in a cooperative game. It analyzes how a concept called the ‘weighted Banzhaf interaction index’ can provide more flexible and computationally efficient explanations compared to older frameworks.
One of the key innovations of FIXLIP is its ability to capture ‘second-order interactions’ – meaning it looks at how pairs of image and text elements influence the model’s output, not just individual elements. This is crucial because, for example, the word ‘black’ might not be important on its own, but its interaction with an image region containing a ‘dog’ could be highly significant for the model to predict similarity to ‘black dog’.
How FIXLIP Works
FIXLIP employs a clever ‘cross-modal sampling’ strategy. Instead of randomly masking parts of the input, it samples masked versions of the image and text independently. Then, it combines these masked versions to efficiently query the model for similarity predictions. This approach significantly speeds up the computation, making it 5 to 20 times faster than previous methods. It also uses ‘p-weighted masking’ to ensure that the model isn’t queried on inputs that are too distorted or ‘out-of-distribution’, which could lead to unreliable explanations.
The method then uses a regression-based approximation to assign ‘attribution scores’ to individual tokens and ‘interaction scores’ to pairs of tokens. These scores reveal which parts of the image and text, and their combinations, contribute positively or negatively to the model’s similarity prediction.
Evaluating Explanations
The researchers also propose new ways to evaluate the quality of these second-order explanations. They extend existing metrics like the ‘pointing game’ and ‘area between insertion/deletion curves’ to assess how faithfully FIXLIP can identify important image-text interactions. For instance, the pointing game evaluates if the explanation correctly highlights the relevant image regions for specific text objects, even when multiple objects are present.
Also Read:
- Unraveling AI’s Multimodal Decisions: A Review of Explainability in Attention Models
- Enhancing Vision-Language Understanding with Adaptive Multi-Prompt Embeddings
Key Findings and Utility
Experiments on standard benchmarks like MS COCO and ImageNet-1k demonstrate that FIXLIP outperforms first-order attribution methods. It provides more faithful explanations, meaning its explanations better reflect how the model actually makes predictions. The pointing game results show that FIXLIP can successfully distinguish between multiple objects in an image based on the text, a task where first-order methods often fail.
Beyond delivering high-quality explanations, FIXLIP proves useful for comparing different vision-language models, such as CLIP versus SigLIP-2, and different versions of the same model. This allows developers to understand the strengths and weaknesses of various model architectures in a more granular way.
The paper also highlights the practical utility of FIXLIP through visual examples. It can show, for instance, if a model is getting the ‘right answer for the wrong reasons’ – like associating the text ‘doll’ with an image of a ‘dollar’ sign due to visual similarity, rather than semantic meaning. It also allows for ‘conditioning on tokens’, where you can see how a specific word or image patch interacts with all other elements, and for ‘visualizing subsets’ of tokens that contribute most or least to similarity.
While FIXLIP represents a significant step forward, the authors acknowledge limitations, such as potential for further computational optimization and the need for more human-computer interaction studies to assess its usability. Future work could also explore extending it to even higher-order interactions or multi-modal settings beyond just image and text.
Ultimately, FIXLIP aims to empower model developers in debugging vision-language encoders, understanding their predictions, and identifying unwanted biases in image-text data, especially as these models are increasingly applied in high-stakes decision-making scenarios. You can read the full research paper here.


