Brain Tumor Imaging: Traditional Neural Networks Outperform Large Language Models

TLDR: A study compared Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) for brain tumor classification and segmentation on MRI scans. CNNs consistently outperformed LLMs in both tasks, demonstrating better accuracy, robustness, and spatial understanding. LLMs, even after fine-tuning, showed limitations in differentiating tumor types and accurately localizing them, suggesting they are not yet well-suited for image-based medical tasks in their current form.

A recent study delves into the effectiveness of Large Language Models (LLMs) for medical imaging tasks, specifically the classification and segmentation of brain tumors from MRI scans. The research, titled A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI, was conducted by Felicia Liu, Jay J. Yoo, and Farzad Khalvati. It compares the performance of these advanced LLMs against traditional Convolutional Neural Networks (CNNs), which have long been a staple in image-based healthcare applications.

While LLMs have shown remarkable capabilities in text-based healthcare tasks, their utility in analyzing medical images has remained largely unexplored. This paper addresses that gap by investigating how a general-purpose vision-language LLM (LLaMA 3.2 Instruct) performs on glioma classification (distinguishing between Low-Grade and High-Grade tumors) and segmentation (identifying the tumor’s exact location and shape) using the BraTS 2020 dataset of multi-modal brain MRIs.

Methodology Overview

The researchers used the BraTS 2020 dataset, which includes T1-weighted, T1-weighted with contrast enhancement, T2-weighted, and Fluid-Attenuated Inversion Recovery (FLAIR) MRI images, along with expert-annotated tumor masks. For CNNs, the models processed full 3D scans across all four imaging modalities, allowing them to capture comprehensive spatial information. In contrast, the LLM approach was limited to 2D axial slices, primarily from the FLAIR modality, due to its architectural constraints. Each slice was processed individually, and patient-level predictions were determined by a majority vote.

For classification, the goal was to label each patient as either Low-Grade Glioma (LGG) or High-Grade Glioma (HGG). For segmentation, three methods were explored for the LLM: predicting the tumor’s center point, drawing a bounding box around it, and tracing a detailed bounding polygon.

Key Findings in Image Classification

The CNN baseline model demonstrated strong and balanced performance in classifying gliomas, achieving 80% accuracy in testing and effectively differentiating between LGG and HGG. It showed good precision and recall, indicating its reliability in identifying both tumor types.

The general LLM, LLaMA 3.2 Instruct, initially showed high accuracy (up to 80%) and recall. However, these metrics were misleading due to a significant class imbalance in the dataset, where HGG cases were far more prevalent. The LLM predominantly predicted HGG for almost all samples, resulting in a very low specificity (as low as 0%), meaning it struggled to correctly identify LGG tumors. It essentially defaulted to the majority class.

Fine-tuning the LLM for classification yielded only marginal improvements. While specificity improved slightly (to 50% in one fine-tuned model), overall performance often declined, and the models still struggled to clearly distinguish between HGG and LGG. Consistency tests further revealed instability in the LLM’s predictions, even with identical inputs, suggesting it hadn’t learned a robust decision boundary.

Key Findings in Image Segmentation

The CNN baseline model showed solid performance in segmenting gliomas, achieving a Dice coefficient of 0.5942 (a measure of overlap between predicted and actual tumor areas). It accurately segmented larger gliomas, capturing their boundaries and general shapes well. However, it faced challenges with smaller or irregularly shaped tumors, sometimes overestimating or under-representing their extent, or even segmenting broader brain regions.

The general LLM’s performance in all three segmentation tasks (center point, bounding box, and bounding polygon) was notably poor. Predicted points and boxes were often scattered or clustered near the image center, showing little to no sensitivity to the actual tumor’s size, location, or shape. Dice coefficients were very low (e.g., 0.1219 for bounding box, 0.0412 for bounding polygon), indicating minimal overlap with true tumor areas. Hausdorff distances were high, reflecting significant localization errors.

Fine-tuning the LLM for segmentation also resulted in minimal improvement. Predictions remained largely random and centered, failing to align with actual tumor locations or sizes. While the models sometimes produced more consistent output formats after fine-tuning, the accuracy of the segmentations did not meaningfully improve.

Also Read:

Conclusion and Future Outlook

The study concludes that, in their current form, traditional Convolutional Neural Networks significantly outperformed Large Language Models in both brain tumor classification and segmentation tasks. CNNs demonstrated superior accuracy, robustness, and the ability to handle class imbalances and complex spatial relationships in medical image data. LLMs, despite their versatility, showed limited spatial understanding and minimal benefits from the fine-tuning strategies employed in this research.

The findings underscore the importance of domain-specific models for specialized medical imaging tasks. While LLMs hold potential for healthcare AI, more rigorous fine-tuning, larger datasets, and alternative training strategies that better integrate spatial and multi-modal information will be necessary for them to achieve comparable performance and utility in image-based medical applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Brain Tumor Imaging: Traditional Neural Networks Outperform Large Language Models

Methodology Overview

Key Findings in Image Classification

Key Findings in Image Segmentation

Conclusion and Future Outlook

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates