TLDR: A new benchmark, MMReID-Bench, has been introduced to evaluate how well Multi-modal Large Language Models (MLLMs) can perform person re-identification across 10 diverse tasks, including various image types and text. The research shows that while MLLMs excel in many scenarios, they face significant challenges with thermal and infrared data, highlighting areas for future improvement in cross-modal understanding.
Person re-identification, often abbreviated as ReID, is a crucial technology that aims to find images of a specific person from a collection of gallery images. This technology has wide-ranging applications, from medical rehabilitation and detecting unusual behavior to enhancing public security. Traditionally, ReID models have been limited to handling only one type of data at a time, like standard RGB images. This uni-modal approach means they struggle to generalize effectively when faced with diverse data types such as thermal images, infrared images, sketches, or even textual descriptions of a person.
The recent rise of multi-modal large language models (MLLMs) has opened up a promising new path to overcome this limitation. However, existing methods that use MLLMs for ReID have not fully leveraged their advanced capabilities. Instead, they often treat MLLMs merely as tools for extracting features or generating captions, missing out on their potential for complex reasoning, following instructions, and understanding information across different modalities.
To address this gap, researchers have introduced MMReID-Bench, the first multi-task, multi-modal benchmark specifically designed for person ReID. This comprehensive benchmark includes 20,710 multi-modal queries and gallery images, covering 10 distinct person ReID tasks. These tasks encompass a wide variety of input types, including standard RGB images, sketches, synthetic images, images from unmanned aerial vehicles (UAVs), occluded person images, cloth-changing scenarios, group re-identification, image-text matching, visible-thermal image matching, and visible-infrared image matching.
The MMReID-Bench is designed to truly test the capabilities of MLLMs. Unlike traditional methods that rely on separate components for feature extraction and matching, MMReID-Bench challenges MLLMs to directly retrieve the target person from gallery images, regardless of the query’s modality. This is achieved through a unified chat template that incorporates task-specific prior information, guiding the MLLMs to analyze patterns and select the correct match.
Key Findings from the Evaluation
The research conducted extensive experiments, evaluating 15 state-of-the-art MLLMs, including both proprietary models like Gemini and GPT families, and open-source models such as Qwen2.5-VL and InternVL. The results demonstrate the remarkable potential of MLLMs in person ReID, but also highlight their current limitations.
On one hand, several MLLMs showed impressive performance on tasks like RGB image, sketch, synthetic, and occluded person ReID. For instance, GPT-4.1 achieved nearly perfect accuracy (99.65%) on the synthetic task and 99.50% on the occluded task, effectively identifying almost all target images. This indicates a strong capability in handling these common and challenging scenarios.
However, the benchmark also revealed significant struggles for most MLLMs when dealing with visible-thermal and visible-infrared person ReID tasks. Even the top-performing models achieved only around 60% accuracy in these areas. This performance drop is attributed to the inherent information loss and unique characteristics of thermal and infrared imaging modalities, which require a deeper level of cross-modal understanding.
The study also analyzed the correlation between different ReID tasks. It found that RGB image, group, synthetic, and UAV person ReID tasks are strongly correlated, suggesting similar underlying features. In contrast, tasks involving cloth-changing, sketches, image-text, and visible-thermal data showed weaker correlations with other tasks, indicating substantial modality gaps that require specialized attention in future MLLM development.
An error analysis of GPT-4.1 on the visible-thermal task revealed that while the model could extract attributes, it sometimes overemphasized minor details while overlooking more significant aspects, such as walking posture. This suggests a need for MLLMs to better prioritize and integrate different types of visual information.
Also Read:
- Assessing Multimodal AI’s Counting Abilities in Real-World Scenarios
- Evaluating Multimodal AI on K-12 School Exams
Model Disparities and Real-World Applications
The research observed a disparity between proprietary and open-source models. While proprietary models generally performed well, open-source models are rapidly catching up in certain tasks. Interestingly, the study also found that larger models within the same series do not always outperform smaller ones, and performance across tasks is not consistently robust, challenging the common assumption of scaling laws.
The impact of gallery size (the number of images to search through) was also investigated. GPT-4o demonstrated more robustness to varying gallery sizes compared to Qwen2.5-VL-7B, maintaining consistently high performance, which is crucial for real-world scenarios involving large surveillance databases.
To showcase practical applicability, the researchers also collected a video-based person ReID dataset from existing video datasets. They simulated a forensic application scenario where MLLMs analyze video clips and textual descriptions (like witness testimonies) to identify suspects. Models from the Qwen2.5-VL family showed competitive performance in this demonstration, suggesting the potential for MLLMs to assist in real-world forensic investigations as evidence accumulates and descriptions become more detailed.
In conclusion, MMReID-Bench serves as a vital tool for evaluating and advancing MLLMs in person ReID. It highlights their impressive capabilities in many areas while clearly pointing out the challenges in cross-modal understanding, particularly with thermal and infrared data. The findings offer valuable insights for developing more robust and generalizable multi-modal foundation models for person ReID in the future.


