TLDR: This study evaluates the Llama 3.2 Vision 11B model on the Visual Entailment (VE) task, which involves determining the relationship between an image and a text hypothesis. Experiments in zero-shot, few-shot, and fine-tuning settings reveal that while fine-tuning significantly improves performance (outperforming state-of-the-art), the model exhibits inconsistencies, sensitivity to prompt variations, and a tendency to hallucinate when visual information is limited. The research highlights both the potential and limitations of VE as a diagnostic tool for vision-language understanding, emphasizing the need for robust evaluation methods and dataset quality.
Recent advancements in Artificial Intelligence have brought significant improvements to both Natural Language Processing and Computer Vision. The exciting field of multimodal learning aims to unify these domains, enabling AI systems to interpret, reason, and generate meaning from a combination of textual and visual inputs. This study delves into how well a vision-language model can truly combine information from these different modalities, specifically focusing on a task called Visual Entailment (VE).
Visual Entailment is a multimodal task that builds upon the traditional Textual Entailment (TE) task. In TE, you’re given a text premise and a text hypothesis, and your goal is to determine if the premise implies the hypothesis. The outcome is one of three labels: Entailment, Contradiction, or Neutral. The key difference in VE is that the text premise is replaced by an image. So, a model must combine a visual premise with a textual hypothesis to make a prediction.
The research paper, titled “Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls,” explores the capabilities and limitations of multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. The authors, Elena Pitta, Tom Kouwenhoven, and Tessa Verhoef, conducted a series of experiments to understand how factors like prompt design, the number and order of in-context examples, and access to visual information influence VE performance.
The Llama 3.2 Vision Model and Dataset
The Llama 3.2 Vision 11B model is a powerful multimodal large language model that combines the Llama 3.1 8B text model with a separately trained vision adapter. It was trained on billions of image-text pairs. For this study, the e-SNLI-VE dataset was used, which is a refined version of the SNLI-VE dataset. The e-SNLI-VE dataset provides higher quality annotations and includes natural language explanations, making it suitable for evaluating both classification and reasoning.
Experimental Findings: Zero-shot, Few-shot, and Fine-tuning
The study explored three main experimental settings:
Zero-shot Inference: In this setting, the model was tested without any additional training. The results showed that Llama 3.2 Vision performed only slightly better than chance, indicating limited capabilities in a zero-shot VE scenario. A surprising finding was the model’s sensitivity to the order of class labels in the prompt, often changing its predictions for the same item based on minor variations. Furthermore, when visual information was limited (e.g., randomly cropped images) or entirely absent (black images), the model’s performance decreased, but not as drastically as one might expect. This suggests a limited reliance on visual input and a strong tendency for the model to “hallucinate” or imagine content to support hypotheses when visual information is missing.
Few-shot Inference: This involved providing the model with a few in-context examples from the training set. Three-shot inference showed a slight improvement over zero-shot, suggesting that a small number of examples can be beneficial. However, increasing the examples to six shots actually led to a decrease in performance. This indicates that more examples don’t always lead to better understanding and can introduce noise or biases related to example ordering.
Fine-tuning: The most significant improvement was observed after fine-tuning the Llama 3.2 Vision model on the VE task. The fine-tuned model achieved an impressive accuracy of 83.3%, outperforming the previous state-of-the-art OFA-X model. Additionally, the fine-tuned model produced semantically meaningful explanations, similar to human-generated ones, as measured by a BERTScore F1-score of 89.16%. However, the study also found comparable BERTScore results in experiments with limited vision, raising questions about whether high explanation quality always reflects true visual grounding.
Also Read:
- Document Haystack: A New Standard for Evaluating AI Document Understanding
- ChartScope: Advancing AI’s Understanding of Visual Data
Implications and Future Directions
The findings highlight both the utility and limitations of the Visual Entailment task as a diagnostic tool for vision-language understanding. While fine-tuning significantly boosts performance, the inconsistencies, sensitivity to prompt variations, and hallucination tendencies observed in zero-shot and few-shot settings point to underlying challenges in how these models process and integrate multimodal information. The study also identified issues with the e-SNLI-VE dataset itself, noting instances of incorrect labels or ambiguous examples.
This research underscores that even advanced multimodal language models like Llama 3.2 Vision may require specific adaptation for complex reasoning tasks like VE. It emphasizes the need for more robust evaluation methods that go beyond simple accuracy metrics to truly probe a model’s understanding. Future work could involve further investigation into few-shot inference, exploring different example sets and orderings, and integrating techniques like Chain-of-Thought prompting to encourage more coherent reasoning. Additionally, systematic prompt engineering and evaluating larger models are crucial steps for advancing the field. For more detailed insights, you can refer to the full research paper here.


