TLDR: Vision-Language Models (VLMs) struggle with understanding negation, a problem called ‘affirmative bias,’ especially in object detection. This is due to a lack of negation data in training and models ignoring negation cues. Researchers propose two solutions: COVAND, a new dataset generated with structured reasoning and VQA to provide high-quality negation data, and NEGTOME, a text token merging module that structurally preserves negation cues and amplifies their signal. Combined with efficient fine-tuning, this approach significantly improves VLM performance on negation benchmarks, reducing false positives and enhancing semantic understanding across various models.
Vision-Language Models (VLMs), despite their advanced capabilities, often struggle with a fundamental aspect of human language: negation. This limitation, frequently termed ‘affirmative bias,’ means these models tend to prioritize nouns while overlooking crucial negation cues like ‘not’ or ‘without.’ This issue is particularly severe in tasks like described object detection (DOD), where distinguishing between ‘a person with a skateboard’ and ‘a person without a skateboard’ is critical. Such failures can have serious implications, especially in safety-critical fields like medical imaging, where misinterpreting ‘a tumor that is not malignant’ could lead to dangerous misdiagnoses.
The core reasons for this ‘negation blindness’ are twofold. Firstly, existing large-scale pre-training datasets for VLMs contain a remarkably low frequency of negation words. For instance, datasets like LAION-400M and Flickr30k have less than 0.1% negation words, a stark contrast to real-world language where negation is far more prevalent. Secondly, even when negation cues are present, current VLM architectures often assign them notably low attention weights, effectively ignoring their semantic importance during processing.
A Two-Pronged Approach to Negation Understanding
To tackle this critical challenge, researchers Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, and Hyunjung Shim have introduced a novel solution with two primary contributions: a new dataset pipeline called COVAND and a lightweight adaptation recipe featuring a module named NEGTOME. Their work, detailed in the paper What “Not” to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging, marks a significant step towards building more robust and reliable VLM-based detection systems.
COVAND: A Dataset Focused on Negation
To address the scarcity of negation data, the team developed COVAND (Chain-of-Thought with VQA Alignment for Negation Detection). This dataset is constructed through a systematic pipeline that combines chain-of-thought (CoT) reasoning and VQA-based caption alignment to generate high-quality, instance-grounded negation data. The process involves extracting both present and absent attributes from object regions in images. For each region, matched positive and negative captions are generated using a CoT approach, followed by semantic verification using a VQA module. This meticulous process ensures that each caption accurately reflects the presence or absence of key attributes, resulting in a rich dataset where approximately 9.29% of words are negation words, a frequency over 100 times higher than typical datasets.
NEGTOME: Structurally Preserving Negation Cues
Beyond data limitations, the researchers observed that negation tokens often receive low attention weights due to how standard tokenizers fragment phrases. For example, ‘not’ and ‘girl’ might be treated as separate tokens, leading the model to simply detect a ‘girl’ and ignore the ‘not.’ NEGTOME (Negation-aware Text Token Merging) directly addresses this architectural flaw. It works by merging fragmented tokens, grouping negation cues with the attributes they modify into coherent semantic phrases. For instance, ‘not’ and ‘girl’ would be bound into a single token representing ‘not girl,’ whose meaning is distinctly different from ‘girl’ alone. This merged representation is then enhanced with a ‘negation-aware boost,’ explicitly amplifying the negated signal to ensure its polarity is preserved. This module is integrated with a parameter-efficient LoRA fine-tuning approach, strategically applied to deep cross-attention layers to enhance multimodal compositional understanding.
Also Read:
- Self-Augmented Decoding: Making Vision-Language Models More Factual
- Learning Dynamics for Stable VLM Finetuning with Cooling-Weighted DPO
Significant Improvements Across Benchmarks
The proposed method has demonstrated substantial improvements in performance on challenging negation benchmarks. It boosted NMS-AP (a stricter metric that penalizes affirmative bias) by up to +10.8 points on OVDEval and significantly reduced the false positive rate by 19.1%. These gains were consistent across various description lengths and even generalized to state-of-the-art VLMs like Grounding-DINO and APE-Ti, as well as Multimodal Large Language Models (MLLMs) like Qwen-2.5-VL. The scalability analysis showed that increasing the size of the COVAND dataset led to further improvements in negation understanding. Furthermore, zero-shot evaluations on the NegBench Multiple Choice Question benchmark confirmed that the method enhances the model’s semantic comprehension of negation beyond just detection tasks.
This research offers a comprehensive solution to the long-standing problem of affirmative bias in VLMs, paving the way for more reliable and context-aware AI systems that can accurately understand not only what is present, but also what is explicitly absent.


