spot_img
HomeResearch & DevelopmentUnpacking How AI Sees: New Study Challenges Texture Bias...

Unpacking How AI Sees: New Study Challenges Texture Bias in CNNs

TLDR: A new study challenges the long-held belief that ImageNet-trained CNNs are inherently texture-biased. Using a novel feature suppression framework, researchers found that CNNs primarily rely on local shape features, a reliance that can be mitigated by modern training. The study also shows that feature reliance varies across domains: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models depend heavily on texture.

For a long time, the prevailing idea in the field of artificial intelligence has been that Convolutional Neural Networks, or CNNs, tend to focus on textures rather than shapes when they interpret images. This contrasts with how humans typically perceive objects, where shape is often the dominant cue. This hypothesis gained significant traction from the “cue-conflict experiment” conducted by Geirhos et al., which suggested that CNNs trained on large datasets like ImageNet had an inherent bias towards texture.

However, a groundbreaking new study by Tom Burgert, Oliver Stoll, Paolo Rota, and Begüm Demir is now questioning this long-held belief. Their research introduces a fresh perspective on how these powerful AI models actually process visual information, proposing a new framework to re-evaluate feature reliance.

The original cue-conflict experiment involved creating unique images that combined the shape of one object with the texture of another. When presented with these hybrid images, CNNs frequently classified them based on the texture, while human observers consistently relied on the shape. This divergence led to the widely accepted narrative that there was a fundamental difference in visual processing between AI and human perception.

Burgert and his team, however, pointed out several limitations in this traditional cue-conflict setup. They argued that the experiment oversimplified feature reliance into a binary choice between shape and texture, potentially overlooking other crucial visual cues like color. From a methodological standpoint, the stimuli used in the experiment might have unintentionally mixed multiple features, distributed texture cues unevenly across images, and even influenced human judgments through response interfaces that favored shape. These factors, they suggested, could have distorted the conclusions about how both models and humans truly utilize visual features.

To address these issues, the researchers developed an innovative, domain-agnostic evaluation framework. Instead of forcing models to choose between shape and texture, their method quantifies feature reliance by systematically suppressing individual visual cues—shape, texture, and and color—and then measuring the resulting impact on classification performance. This approach employs direct feature-suppressing transformations, avoiding the complexities of adversarial inputs or neural style transfer, which allows for a clearer assessment of a model’s dependence on specific visual information.

Through their extensive experiments, the team discovered that CNNs are not inherently biased towards texture. Instead, they primarily rely on local shape features. For example, a standard ResNet50 model experienced significant performance drops when local shape information was removed, yet it maintained much of its accuracy when texture was suppressed. This reliance on local shape, however, can be substantially reduced and made more robust through modern training strategies and advanced architectures such as ConvNeXt and Vision Transformers (ViTs). Interestingly, models trained with vision-language supervision, like CLIP-ViT, demonstrated feature reliance patterns that most closely mirrored human behavior. This indicates that sophisticated training methods can encourage CNNs to develop representations that are more aligned with human perception.

The study further expanded its analysis to various visual domains, including computer vision (CV), medical imaging (MI), and remote sensing (RS). The findings revealed that feature reliance patterns systematically differ across these domains. Computer vision models, especially when trained on natural images, predominantly prioritize shape. Medical imaging models, conversely, showed a stronger emphasis on color, which is often a critical diagnostic indicator in medical tasks. Remote sensing models exhibited a pronounced dependence on both texture and color, reflecting the nature of aerial imagery where land cover categories are frequently defined by fine-grained surface patterns and chromatic cues.

Also Read:

These findings challenge the long-standing texture bias hypothesis, suggesting that feature reliance in deep learning models is not a fixed architectural bias but rather a flexible characteristic shaped by training objectives and the specific properties of the data domain. This new understanding opens up exciting possibilities for designing AI models that can better align with human perceptual strategies, potentially leading to more robust, interpretable, and effective systems. The code for their framework is publicly available for further research and exploration. You can read the full paper for more details here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -