spot_img
HomeResearch & DevelopmentMulti-Modal LLMs Outperform CNNs in Object Detection with Minimal...

Multi-Modal LLMs Outperform CNNs in Object Detection with Minimal Data

TLDR: A new study demonstrates that fine-tuning multi-modal Large Language Models (LLMs) like Phi-3.5 Vision on fewer than 1,000 images can significantly outperform traditional CNNs and zero-shot LLM approaches for complex object detection tasks, such as artificial text overlay detection. The fine-tuned LLM achieved up to 36% higher accuracy and notably higher precision, proving the efficiency and adaptability of language-guided models in low-resource settings.

The world of object detection, a cornerstone of computer vision, is undergoing a significant transformation. Traditionally, tasks like identifying objects in images have relied heavily on Convolutional Neural Networks (CNNs) such as ResNet and YOLO. While these models have been highly effective, recent advancements in multi-modal large language models (LLMs) are introducing new capabilities, including dynamic context reasoning and language-guided understanding.

However, simply using LLMs out-of-the-box often doesn’t unlock their full potential for specialized visual tasks. A new study by Nirmal Elamon and Rouzbeh Davoudi from Artificial Creative Intelligence (ACI) at Expedia Group explores how these powerful LLMs can be efficiently fine-tuned, even with very limited data, to achieve superior performance in object detection. Their work, titled “Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes,” was accepted to the 2025 IEEE International Conference on Content-Based Multimedia Indexing (CBMI).

The researchers focused on a particularly challenging task: detecting artificial text overlays in images. This is crucial for applications like digital media verification and content moderation, where distinguishing between naturally embedded text (like a street sign) and text artificially added to an image (like a promotional banner) is vital. Traditional CNNs often struggle with this nuance, frequently misclassifying natural text as artificial because they lack the deeper contextual understanding that LLMs can provide.

Comparing Different Approaches

The study conducted a comprehensive comparison across four different modeling strategies:

  • Traditional Fine-tuned CNNs: A conventional CNN architecture (OCR + ResNet) was fine-tuned on a large dataset of 10,000 images. This model integrates visual and textual cues, leveraging OCR outputs and positional metadata.
  • Zero-shot Pre-trained Multi-modal LLMs: A pre-trained LLM (Phi-3.5 Vision) was used without any specific fine-tuning, relying on its existing knowledge and a single descriptive prompt.
  • Zero-shot Pre-trained LLMs with Sequential Prompting: This approach extended the basic zero-shot LLM by using a sequence of prompts. First, it identified all text and objects, along with their spatial relationships, and then used this enriched context in a second prompt to refine the detection of artificial text overlays.
  • Fine-tuned Multi-modal LLMs: The Phi-3.5 Vision model was fine-tuned on a much smaller, domain-specific dataset of just 1,000 annotated images. This process involved training for only two epochs with binary cross-entropy loss.

Key Findings and Performance

The results were striking. The fine-tuned LLM demonstrated significantly superior performance across all evaluation metrics: precision, recall, and accuracy. It achieved a precision of 0.98, recall of 0.84, and an accuracy of 0.83. This highlights that even with a relatively small dataset, task-specific adaptation of LLMs is incredibly effective.

In contrast, the pre-trained LLM without fine-tuning performed notably worse, with an accuracy of 0.60. While it had a high recall (0.80), indicating it could detect many instances of overlaid text, its poor precision (0.66) meant it frequently misclassified natural text as artificial. The sequential prompting variant showed some improvement in precision (0.75) but still had low recall (0.51) and overall accuracy (0.54), indicating the limitations of prompt chaining without actual model weight updates.

Perhaps the most compelling finding was the performance of the traditional CNN model. Despite being trained on a dataset ten times larger (10,000 images), it performed the worst, with an accuracy of 0.47 and precision of 0.54. This underscores its limited capacity to reason over complex semantic relationships and contextual cues, which are essential for distinguishing between natural and artificial text.

The study also compared the fine-tuned Phi-3.5 Vision with another powerful off-the-shelf LLM, Qwen2.5-VL-7B-Instruct, which achieved an accuracy of 0.78. While competitive with zero-shot baselines, it still fell short of the fine-tuned Phi-3.5 Vision, further emphasizing the value of lightweight fine-tuning for domain-specific tasks.

Also Read:

Implications for the Future

This research demonstrates that fine-tuning multi-modal LLMs on fewer than 1,000 images can lead to up to a 36% accuracy improvement compared to CNN baselines, along with significantly higher precision. This not only reduces false positives but also ensures more reliable detection. The findings highlight the efficiency, adaptability, and data efficiency of language-guided models, especially in environments with limited data resources.

The proposed fine-tuning strategy is broadly applicable beyond artificial text overlay detection and can be extended to other complex vision-language tasks requiring nuanced contextual understanding. This work offers a practical and scalable approach to bridging vision and language with minimal supervision. You can find more details about their work and access the code used to fine-tune the models at their GitHub repository. Read the full paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -