TLDR: Researchers have developed SElf-Evolving Distillation (SEED), a novel method to combat ‘hallucinations’ in Large Vision-Language Models (LVLMs). SEED identifies and removes inaccurate information within the model’s internal knowledge, then distills the corrected knowledge back, allowing the model to improve itself without needing external tools or increasing processing time. This approach significantly enhances the reliability of LVLMs like LLaVA-1.5 and InternVL2 across various evaluation benchmarks.
Large Vision-Language Models (LVLMs) have made incredible strides in understanding and generating content from both images and text. They power many advanced applications, from answering questions about images to creating detailed captions. However, a significant challenge that limits their trustworthiness and real-world use is ‘hallucination’ – where the model generates information that isn’t supported by the visual input. For example, an LVLM might describe a white strawberry in a picture as red, simply because its internal knowledge biases it towards the common color of strawberries.
Traditional methods to fix these hallucinations often involve using external tools or comparing multiple rounds of the model’s responses. While somewhat effective, these approaches can drastically slow down the model’s inference time, making them less practical for real-time applications. Another method involves collecting vast amounts of new, diverse training data, which is incredibly expensive and time-consuming.
Introducing Self-Evolving Distillation (SEED)
A new research paper, titled “Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation,” introduces a groundbreaking method called SElf-Evolving Distillation (SEED). This innovative approach tackles hallucinations from within the LVLM’s own knowledge. Instead of relying on external help or costly data collection, SEED works by identifying the parts of the model’s internal knowledge that lead to hallucinations, isolating them, purging them, and then distilling this purified knowledge back into the model. This allows the LVLM to ‘self-evolve’ and correct its own biases.
The core idea behind SEED is to make the model learn from its own mistakes and refine its understanding. It does this in several clever steps:
- Hallucination Identification: SEED first identifies potential hallucinations by assessing the model’s confidence in its own outputs. If the model shows low confidence in a particular piece of information, it’s flagged as potentially hallucinatory.
- Hallucination Isolation: To pinpoint the exact hallucinated knowledge, SEED subtly alters the visual input by adding noise. This makes the LVLM rely more on its learned textual biases, effectively bringing out the hallucinated parts of its knowledge.
- Hallucination Purification: Once identified and isolated, the hallucinatory knowledge is then ‘subtracted’ from the model’s original knowledge, resulting in a purified version. The degree of purification is adjusted based on how confident the model was initially.
- Distilling Knowledge Back: The purified knowledge is then distilled back into the LVLM. This is like teaching the model the correct information, replacing its biased understanding. Crucially, this process doesn’t add any extra steps or time during the model’s normal operation.
Enhancements for Stability and Accuracy
The researchers also introduced two key enhancements to SEED:
- Mode-Seeking Evolving: Traditional knowledge distillation methods can sometimes lead to the model assigning probabilities to “void spaces” – regions of the output that don’t correspond to meaningful information. Mode-Seeking Evolving ensures that the distillation process focuses on the most dominant and accurate patterns in the purified knowledge, preventing chaotic or nonsensical outputs.
- Hallucination Elimination Adapter: To ensure the stability of this self-evolution process, a special ‘Adapter’ is used. This adapter allows the original LVLM to remain stable as a knowledge source while only the adapter itself is updated during the purification process. This not only makes the training more robust but also saves significant memory resources.
Also Read:
- New Algorithm Reduces AI Hallucinations in Vision-Language Models by Enhancing Multimodal Interaction Focus
- Building Trustworthy AI: A Proactive Approach to Combat Misinformation in Language Models
Impressive Results Across Benchmarks
Extensive experiments were conducted on popular LVLMs like LLaVA-1.5 and InternVL2, using widely recognized benchmarks such as POPE, MME, and MM-Vet, which evaluate different aspects of hallucination. The results were highly promising. For instance, the F1 score of LLaVA-1.5 on the POPE-Random hallucination evaluation metric improved significantly from 81.3 to 88.3. Across the board, SEED demonstrated superior performance compared to existing methods, often achieving better results with half the inference cost of some multi-round approaches.
The paper highlights practical examples where SEED corrected common LVLM biases. For example, an original LLaVA model might incorrectly assume Cristiano Ronaldo is always associated with a soccer ball, even when shown a picture of him playing table tennis. After SEED, the model accurately described the table tennis scene, demonstrating its ability to align responses with visual input. This research marks a significant step towards making LVLMs more reliable and trustworthy for a wide range of applications. You can read the full research paper here.


