TLDR: AttentionViG is a new Vision Graph Neural Network (ViG) architecture that uses a cross-attention mechanism to dynamically weigh the importance of neighboring nodes. This method improves feature aggregation, allowing the model to achieve state-of-the-art performance in image classification (ImageNet-1K), object detection, instance segmentation (MS COCO), and semantic segmentation (ADE20K) while maintaining efficiency. It addresses limitations of fixed graph constructions by learning to focus on semantically relevant neighbors.
The field of computer vision has seen significant advancements with various neural network architectures, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). More recently, Vision Graph Neural Networks (ViGs) have emerged as a promising alternative, offering unique ways to process image data. A critical component of any ViG is how it aggregates features from a node’s neighbors within the graph structure.
Traditional graph convolution methods used in ViGs, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, often struggle to effectively capture complex relationships between a node and its neighbors. These methods typically lack a mechanism to assign varying importance to different neighbors, treating them all equally. This can lead to suboptimal performance, especially when the graph construction method (how neighbors are chosen) is not perfect or is fixed.
To address this challenge, researchers have introduced a novel approach called AttentionViG. This new architecture proposes a cross-attention-based aggregation method. In this scheme, the “query” for attention comes from the central node, while the “keys” come from its surrounding neighbors. This allows the model to dynamically learn the relevance of each neighbor to the central node, effectively weighing their contributions. Instead of enforcing competition among neighbors (as softmax attention often does), AttentionViG uses an exponential kernel to convert similarity scores into attention weights, allowing for more flexible aggregation.
AttentionViG is designed as a multi-scale hybrid network, combining the strengths of CNNs and GNNs. It uses inverted residual blocks for local processing and “Grapher” layers, which implement the new cross-attention aggregation, for non-local message passing. For graph construction, it adopts a computationally efficient method called Sparse Vision Graph Attention (SVGA), which uses a fixed criss-cross pattern for connecting nodes. The beauty of AttentionViG is that its cross-attention mechanism can mitigate the limitations of this fixed graph construction by intelligently filtering out semantically irrelevant neighbors.
The performance of AttentionViG was rigorously evaluated across several benchmarks. On the ImageNet-1K dataset for image classification, AttentionViG achieved state-of-the-art performance, outperforming many existing CNNs, ViTs, and other ViGs. For instance, its smallest model achieved 81.3% top-1 accuracy, and its largest model reached 83.9%.
Beyond classification, AttentionViG demonstrated strong transferability to downstream tasks. In object detection and instance segmentation on the MS COCO 2017 dataset, and semantic segmentation on the ADE20K dataset, AttentionViG models consistently delivered competitive accuracy. This indicates that the proposed method is not only effective but also maintains efficiency, offering strong performance with comparable computational costs (FLOPs) to prior vision GNN architectures.
A key insight from the research is that the cross-attention mechanism learns to amplify neighbors that are semantically related to the query location while suppressing unrelated regions. This dynamic weighting helps compensate for imperfections in how the graph is initially constructed. The researchers also found that their exponential affinity function for attention weights empirically outperforms traditional softmax, suggesting that allowing for more flexible, non-competitive attention among neighbors can be beneficial in visual tasks.
Also Read:
- Bridging Disparate Data with Indirect Attention
- Point2RBox-v3: Advancing Oriented Object Detection with Point Annotations
This research marks a significant step forward in Vision GNNs by providing a versatile aggregation method that can effectively capture complex node-neighbor relationships. The paper, “AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs,” can be found here. While the current focus is on image recognition, the proposed aggregation method has broad applicability and could be extended to other graph-based learning tasks like video understanding, point cloud processing, and biological networks.


