Evaluating AI Models for Detailed Fashion Product Tagging

TLDR: A study evaluated GPT-4o-mini and Gemini 2.0 Flash for automatically identifying fine-grained fashion attributes from images in a zero-shot setting. Gemini 2.0 Flash showed superior accuracy (56.79% F1 score) and was also more cost-effective and faster than GPT-4o-mini. While promising for e-commerce product attribution, the models performed better on prominent attributes and struggled with subtle details, indicating a need for further domain-specific refinement or human-in-the-loop systems.

The world of online fashion retail thrives on understanding its products. Imagine browsing a clothing website and being able to filter by very specific details like sleeve length, fabric type, or even the style of a neckline. This ability to accurately categorize and tag products with detailed attributes is called product attribution, and it’s crucial for a smooth customer experience and efficient inventory management.

Traditionally, product attribution has been a labor-intensive process, often relying on human annotators or seller-provided information. As fashion marketplaces grow to include millions of items, this manual approach becomes slow, prone to errors, and difficult to scale. This challenge has led industry experts to explore whether advanced artificial intelligence, specifically large language models (LLMs) with multimodal capabilities (meaning they can understand both text and images), could automate this complex task.

A recent research paper, titled “Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis,” delves into this very question. Authored by Shubbham Shukla and Kunal Sonalkar, the study evaluates the performance of two state-of-the-art, cost-efficient LLMs: GPT-4o-mini and Gemini 2.0 Flash. The goal was to see how well these models could identify detailed fashion attributes directly from images, without any prior specific training on fashion data – a method known as “zero-shot” analysis.

The researchers used the DeepFashion-MultiModal dataset, which contains high-quality, human-annotated labels for 18 different fashion attribute categories. These categories are grouped into three main classes: Shape Attributes (like sleeve length, neckline, and accessories), Fabric Type Attributes (such as denim, cotton, or leather), and Color Pattern Attributes (like floral, striped, or pure color). The models were given only an image as input, making it a pure test of their visual understanding.

The methodology involved feeding an input image to a “Prompt generation module” which created a query for the LLM. This query, along with the image, was sent to a “Prediction Engine” that interfaced with OpenRouter to call either Gemini 2.0 Flash or GPT-4o-mini. The model’s raw output was then parsed into a standardized format, and an “Evaluation Engine” compared these predictions against the true labels to calculate performance metrics like precision, recall, and F1-score.

The study conducted two main experiments. In the first, a “high-creativity” setting (with higher temperature and top p values), Gemini 2.0 Flash significantly outperformed GPT-4o-mini, achieving a macro F1-score of 49.72% compared to GPT-4o-mini’s 37.31%.

The second experiment used a more “deterministic” setting (with lower temperature and top p values), which encourages more focused and predictable outputs. In this setup, both models improved their performance. Gemini 2.0 Flash again demonstrated superior results with a macro F1-score of 56.79%, while GPT-4o-mini reached 43.28%. This highlights that for structured classification tasks like product attribution, reducing the model’s creative freedom leads to more accurate predictions.

Beyond just accuracy, the researchers also analyzed the practical aspects of using these models: cost and speed. Gemini 2.0 Flash proved to be not only more accurate but also more efficient. It was approximately 12.5% cheaper and 24% faster than GPT-4o-mini for processing 1000 images, making it a more economically viable option for large-scale deployment in e-commerce.

The findings indicate that while both models can perform zero-shot attribute extraction, Gemini 2.0 Flash is the clear leader in terms of accuracy, speed, and cost-efficiency. However, the study also revealed that both models performed better on visually prominent attributes like “Hat” and “Sleeve Length” but struggled with more subtle details such as “Neckline” and “Waist Accessories.” This suggests that while these LLMs have strong general visual recognition, they may still lack the specialized knowledge needed for consistently identifying very nuanced fashion details.

In conclusion, this research suggests that lightweight LLMs like Gemini 2.0 Flash can be powerful tools for e-commerce platforms. They can help reduce manual labor, speed up the process of adding new products, and enrich product catalogs, ultimately improving the customer’s shopping experience. While they might not yet fully replace human annotators for every subtle detail, they offer a promising solution for integrating AI-driven attribution into existing workflows. For more details, you can read the full research paper here.

Also Read:

Future work in this area includes exploring more advanced prompt engineering techniques, benchmarking these LLMs against specialized, fine-tuned computer vision models, and expanding the scope to include more subjective fashion attributes and diverse datasets.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI Models for Detailed Fashion Product Tagging

Gen AI News and Updates

Consumers Increasingly Embrace AI for Holiday Shopping, Applause Survey Finds

AI Agents, Stablecoins, and Biometrics Poised to Revolutionize Global Payment Systems by 2026

Alby’s AI Shopping Agents Surpass 20 Million Consumer Conversations, Driving Significant Revenue and Sales Conversion for E-commerce Brands

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates