MMReID-Bench: A New Standard for Multi-modal Person Re-identification

TLDR: A new benchmark, MMReID-Bench, has been introduced to evaluate how well Multi-modal Large Language Models (MLLMs) can perform person re-identification across 10 diverse tasks, including various image types and text. The research shows that while MLLMs excel in many scenarios, they face significant challenges with thermal and infrared data, highlighting areas for future improvement in cross-modal understanding.

Person re-identification, often abbreviated as ReID, is a crucial technology that aims to find images of a specific person from a collection of gallery images. This technology has wide-ranging applications, from medical rehabilitation and detecting unusual behavior to enhancing public security. Traditionally, ReID models have been limited to handling only one type of data at a time, like standard RGB images. This uni-modal approach means they struggle to generalize effectively when faced with diverse data types such as thermal images, infrared images, sketches, or even textual descriptions of a person.

The recent rise of multi-modal large language models (MLLMs) has opened up a promising new path to overcome this limitation. However, existing methods that use MLLMs for ReID have not fully leveraged their advanced capabilities. Instead, they often treat MLLMs merely as tools for extracting features or generating captions, missing out on their potential for complex reasoning, following instructions, and understanding information across different modalities.

To address this gap, researchers have introduced MMReID-Bench, the first multi-task, multi-modal benchmark specifically designed for person ReID. This comprehensive benchmark includes 20,710 multi-modal queries and gallery images, covering 10 distinct person ReID tasks. These tasks encompass a wide variety of input types, including standard RGB images, sketches, synthetic images, images from unmanned aerial vehicles (UAVs), occluded person images, cloth-changing scenarios, group re-identification, image-text matching, visible-thermal image matching, and visible-infrared image matching.

The MMReID-Bench is designed to truly test the capabilities of MLLMs. Unlike traditional methods that rely on separate components for feature extraction and matching, MMReID-Bench challenges MLLMs to directly retrieve the target person from gallery images, regardless of the query’s modality. This is achieved through a unified chat template that incorporates task-specific prior information, guiding the MLLMs to analyze patterns and select the correct match.

Key Findings from the Evaluation

The research conducted extensive experiments, evaluating 15 state-of-the-art MLLMs, including both proprietary models like Gemini and GPT families, and open-source models such as Qwen2.5-VL and InternVL. The results demonstrate the remarkable potential of MLLMs in person ReID, but also highlight their current limitations.

On one hand, several MLLMs showed impressive performance on tasks like RGB image, sketch, synthetic, and occluded person ReID. For instance, GPT-4.1 achieved nearly perfect accuracy (99.65%) on the synthetic task and 99.50% on the occluded task, effectively identifying almost all target images. This indicates a strong capability in handling these common and challenging scenarios.

However, the benchmark also revealed significant struggles for most MLLMs when dealing with visible-thermal and visible-infrared person ReID tasks. Even the top-performing models achieved only around 60% accuracy in these areas. This performance drop is attributed to the inherent information loss and unique characteristics of thermal and infrared imaging modalities, which require a deeper level of cross-modal understanding.

The study also analyzed the correlation between different ReID tasks. It found that RGB image, group, synthetic, and UAV person ReID tasks are strongly correlated, suggesting similar underlying features. In contrast, tasks involving cloth-changing, sketches, image-text, and visible-thermal data showed weaker correlations with other tasks, indicating substantial modality gaps that require specialized attention in future MLLM development.

An error analysis of GPT-4.1 on the visible-thermal task revealed that while the model could extract attributes, it sometimes overemphasized minor details while overlooking more significant aspects, such as walking posture. This suggests a need for MLLMs to better prioritize and integrate different types of visual information.

Also Read:

Model Disparities and Real-World Applications

The research observed a disparity between proprietary and open-source models. While proprietary models generally performed well, open-source models are rapidly catching up in certain tasks. Interestingly, the study also found that larger models within the same series do not always outperform smaller ones, and performance across tasks is not consistently robust, challenging the common assumption of scaling laws.

The impact of gallery size (the number of images to search through) was also investigated. GPT-4o demonstrated more robustness to varying gallery sizes compared to Qwen2.5-VL-7B, maintaining consistently high performance, which is crucial for real-world scenarios involving large surveillance databases.

To showcase practical applicability, the researchers also collected a video-based person ReID dataset from existing video datasets. They simulated a forensic application scenario where MLLMs analyze video clips and textual descriptions (like witness testimonies) to identify suspects. Models from the Qwen2.5-VL family showed competitive performance in this demonstration, suggesting the potential for MLLMs to assist in real-world forensic investigations as evidence accumulates and descriptions become more detailed.

In conclusion, MMReID-Bench serves as a vital tool for evaluating and advancing MLLMs in person ReID. It highlights their impressive capabilities in many areas while clearly pointing out the challenges in cross-modal understanding, particularly with thermal and infrared data. The findings offer valuable insights for developing more robust and generalizable multi-modal foundation models for person ReID in the future.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MMReID-Bench: A New Standard for Multi-modal Person Re-identification

Key Findings from the Evaluation

Model Disparities and Real-World Applications

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates