Multi-Modal LLMs Outperform CNNs in Object Detection with Minimal Data

TLDR: A new study demonstrates that fine-tuning multi-modal Large Language Models (LLMs) like Phi-3.5 Vision on fewer than 1,000 images can significantly outperform traditional CNNs and zero-shot LLM approaches for complex object detection tasks, such as artificial text overlay detection. The fine-tuned LLM achieved up to 36% higher accuracy and notably higher precision, proving the efficiency and adaptability of language-guided models in low-resource settings.

The world of object detection, a cornerstone of computer vision, is undergoing a significant transformation. Traditionally, tasks like identifying objects in images have relied heavily on Convolutional Neural Networks (CNNs) such as ResNet and YOLO. While these models have been highly effective, recent advancements in multi-modal large language models (LLMs) are introducing new capabilities, including dynamic context reasoning and language-guided understanding.

However, simply using LLMs out-of-the-box often doesn’t unlock their full potential for specialized visual tasks. A new study by Nirmal Elamon and Rouzbeh Davoudi from Artificial Creative Intelligence (ACI) at Expedia Group explores how these powerful LLMs can be efficiently fine-tuned, even with very limited data, to achieve superior performance in object detection. Their work, titled “Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes,” was accepted to the 2025 IEEE International Conference on Content-Based Multimedia Indexing (CBMI).

The researchers focused on a particularly challenging task: detecting artificial text overlays in images. This is crucial for applications like digital media verification and content moderation, where distinguishing between naturally embedded text (like a street sign) and text artificially added to an image (like a promotional banner) is vital. Traditional CNNs often struggle with this nuance, frequently misclassifying natural text as artificial because they lack the deeper contextual understanding that LLMs can provide.

Comparing Different Approaches

The study conducted a comprehensive comparison across four different modeling strategies:

Traditional Fine-tuned CNNs: A conventional CNN architecture (OCR + ResNet) was fine-tuned on a large dataset of 10,000 images. This model integrates visual and textual cues, leveraging OCR outputs and positional metadata.
Zero-shot Pre-trained Multi-modal LLMs: A pre-trained LLM (Phi-3.5 Vision) was used without any specific fine-tuning, relying on its existing knowledge and a single descriptive prompt.
Zero-shot Pre-trained LLMs with Sequential Prompting: This approach extended the basic zero-shot LLM by using a sequence of prompts. First, it identified all text and objects, along with their spatial relationships, and then used this enriched context in a second prompt to refine the detection of artificial text overlays.
Fine-tuned Multi-modal LLMs: The Phi-3.5 Vision model was fine-tuned on a much smaller, domain-specific dataset of just 1,000 annotated images. This process involved training for only two epochs with binary cross-entropy loss.

Key Findings and Performance

The results were striking. The fine-tuned LLM demonstrated significantly superior performance across all evaluation metrics: precision, recall, and accuracy. It achieved a precision of 0.98, recall of 0.84, and an accuracy of 0.83. This highlights that even with a relatively small dataset, task-specific adaptation of LLMs is incredibly effective.

In contrast, the pre-trained LLM without fine-tuning performed notably worse, with an accuracy of 0.60. While it had a high recall (0.80), indicating it could detect many instances of overlaid text, its poor precision (0.66) meant it frequently misclassified natural text as artificial. The sequential prompting variant showed some improvement in precision (0.75) but still had low recall (0.51) and overall accuracy (0.54), indicating the limitations of prompt chaining without actual model weight updates.

Perhaps the most compelling finding was the performance of the traditional CNN model. Despite being trained on a dataset ten times larger (10,000 images), it performed the worst, with an accuracy of 0.47 and precision of 0.54. This underscores its limited capacity to reason over complex semantic relationships and contextual cues, which are essential for distinguishing between natural and artificial text.

The study also compared the fine-tuned Phi-3.5 Vision with another powerful off-the-shelf LLM, Qwen2.5-VL-7B-Instruct, which achieved an accuracy of 0.78. While competitive with zero-shot baselines, it still fell short of the fine-tuned Phi-3.5 Vision, further emphasizing the value of lightweight fine-tuning for domain-specific tasks.

Also Read:

Implications for the Future

This research demonstrates that fine-tuning multi-modal LLMs on fewer than 1,000 images can lead to up to a 36% accuracy improvement compared to CNN baselines, along with significantly higher precision. This not only reduces false positives but also ensures more reliable detection. The findings highlight the efficiency, adaptability, and data efficiency of language-guided models, especially in environments with limited data resources.

The proposed fine-tuning strategy is broadly applicable beyond artificial text overlay detection and can be extended to other complex vision-language tasks requiring nuanced contextual understanding. This work offers a practical and scalable approach to bridging vision and language with minimal supervision. You can find more details about their work and access the code used to fine-tune the models at their GitHub repository. Read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Multi-Modal LLMs Outperform CNNs in Object Detection with Minimal Data

Comparing Different Approaches

Key Findings and Performance

Implications for the Future

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates