Mercari's New Visual Search System Drives User Engagement and Sales

TLDR: Mercari has successfully deployed a scalable visual search system in its C2C marketplace, leveraging zero-shot vision-language models. The multilingual SigLIP model significantly outperformed existing baselines in offline evaluations and led to substantial increases in user engagement, conversion rates, and transactions during online A/B testing. The system uses dimensionality reduction for efficiency and highlights the practicality of zero-shot models for real-world visual search applications, despite ongoing challenges with precise identity matching.

In the dynamic world of consumer-to-consumer (C2C) marketplaces, where everyday individuals list a vast array of second-hand or surplus items, finding exactly what you’re looking for can be a challenge. Unlike traditional retail platforms with structured product catalogs, C2C listings often lack consistent naming conventions, category assignments, and uniform visual quality. This makes traditional text-based search engines less effective, especially for items that are primarily identified by their visual characteristics, such as fashion, character goods, or collectibles.

Mercari, a prominent C2C marketplace in Japan with over 20 million monthly active users, recognized this challenge. They sought to enhance product discovery for both buyers and sellers by implementing a scalable visual search system. This system allows users to upload an image and find visually similar items, offering an intuitive alternative to text-based searches. It also helps sellers research market values by looking up similar items before listing their own.

The core of Mercari’s new system lies in its adoption of advanced vision-language models, particularly those capable of ‘zero-shot’ retrieval. Zero-shot models are pre-trained on massive datasets of image-text pairs and can generalize well to new domains without requiring extensive fine-tuning on specific marketplace data. This is a significant advantage in a rapidly evolving C2C environment, where traditional fine-tuned models can be costly to maintain and less robust to changes in product listings.

Mercari evaluated several models, including their existing fine-tuned ‘baseline’ model, a Japanese CLIP-based model, DINOv2, and the multilingual SigLIP model. Through rigorous offline evaluations using user interaction logs, the multilingual SigLIP model emerged as the top performer. It achieved a 13.3% increase in nDCG@5 (a key retrieval metric) over the baseline, demonstrating superior precision and recall across all metrics. Importantly, SigLIP also maintained comparable computational efficiency to other models, making it ideal for production deployment.

Qualitative assessments further confirmed SigLIP’s strength. It consistently produced more semantically relevant and contextually accurate results, even with noisy, user-uploaded images. For instance, it could accurately identify specific characters in images, a task where the baseline model often struggled to differentiate similar-looking objects. This robustness to image noise and ability to generalize to nuanced visual contexts highlighted SigLIP’s potential for a more intuitive and effective image search experience.

The visual search system is designed for both real-time image-based retrieval and continuous background catalog indexing. When a user uploads an image, it’s processed by an image embedding generator, which converts the image into a compact 128-dimensional embedding. This embedding is then used to find the most similar items in the catalog. To ensure efficiency, the original 768-dimensional SigLIP embeddings are reduced to 128 dimensions using Principal Component Analysis (PCA). This dimensionality reduction significantly improves system efficiency, leading to approximately a 40% reduction in query latency and an 83% decrease in memory usage and index size, without compromising search quality.

To validate the system’s real-world impact, Mercari conducted a one-week online A/B test. The results were compelling: the group using the multilingual SigLIP model showed substantial gains in engagement and conversion. There was a 40.9% increase in average transactions per user via image search, a 34.1% increase in buyer conversion rate via image search, and a 46.6% increase in item view count per user via image search. These figures underscore the significant positive impact on user behavior and purchase rates.

Currently, Mercari’s image search is utilized by approximately 1.5 million users monthly, contributing to increased purchases and new matching experiences across categories like fashion, talent, and character goods. It also aids sellers in market price research. However, the system still faces challenges, particularly in precise identity matching, where customers expect exact matches for specific people or animated characters. Future work aims to address this by exploring finer-grained retrieval, personalization strategies, and deeper analysis of long-term user behavior.

Also Read:

This work demonstrates that zero-shot vision-language models can serve as a strong and practical foundation for deploying effective visual search systems in large-scale C2C marketplaces with minimal overhead, while retaining flexibility for future enhancements. You can read the full research paper here: Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Mercari’s New Visual Search System Drives User Engagement and Sales

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates