TLDR: A study compared Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) for brain tumor classification and segmentation on MRI scans. CNNs consistently outperformed LLMs in both tasks, demonstrating better accuracy, robustness, and spatial understanding. LLMs, even after fine-tuning, showed limitations in differentiating tumor types and accurately localizing them, suggesting they are not yet well-suited for image-based medical tasks in their current form.
A recent study delves into the effectiveness of Large Language Models (LLMs) for medical imaging tasks, specifically the classification and segmentation of brain tumors from MRI scans. The research, titled A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI, was conducted by Felicia Liu, Jay J. Yoo, and Farzad Khalvati. It compares the performance of these advanced LLMs against traditional Convolutional Neural Networks (CNNs), which have long been a staple in image-based healthcare applications.
While LLMs have shown remarkable capabilities in text-based healthcare tasks, their utility in analyzing medical images has remained largely unexplored. This paper addresses that gap by investigating how a general-purpose vision-language LLM (LLaMA 3.2 Instruct) performs on glioma classification (distinguishing between Low-Grade and High-Grade tumors) and segmentation (identifying the tumor’s exact location and shape) using the BraTS 2020 dataset of multi-modal brain MRIs.
Methodology Overview
The researchers used the BraTS 2020 dataset, which includes T1-weighted, T1-weighted with contrast enhancement, T2-weighted, and Fluid-Attenuated Inversion Recovery (FLAIR) MRI images, along with expert-annotated tumor masks. For CNNs, the models processed full 3D scans across all four imaging modalities, allowing them to capture comprehensive spatial information. In contrast, the LLM approach was limited to 2D axial slices, primarily from the FLAIR modality, due to its architectural constraints. Each slice was processed individually, and patient-level predictions were determined by a majority vote.
For classification, the goal was to label each patient as either Low-Grade Glioma (LGG) or High-Grade Glioma (HGG). For segmentation, three methods were explored for the LLM: predicting the tumor’s center point, drawing a bounding box around it, and tracing a detailed bounding polygon.
Key Findings in Image Classification
The CNN baseline model demonstrated strong and balanced performance in classifying gliomas, achieving 80% accuracy in testing and effectively differentiating between LGG and HGG. It showed good precision and recall, indicating its reliability in identifying both tumor types.
The general LLM, LLaMA 3.2 Instruct, initially showed high accuracy (up to 80%) and recall. However, these metrics were misleading due to a significant class imbalance in the dataset, where HGG cases were far more prevalent. The LLM predominantly predicted HGG for almost all samples, resulting in a very low specificity (as low as 0%), meaning it struggled to correctly identify LGG tumors. It essentially defaulted to the majority class.
Fine-tuning the LLM for classification yielded only marginal improvements. While specificity improved slightly (to 50% in one fine-tuned model), overall performance often declined, and the models still struggled to clearly distinguish between HGG and LGG. Consistency tests further revealed instability in the LLM’s predictions, even with identical inputs, suggesting it hadn’t learned a robust decision boundary.
Key Findings in Image Segmentation
The CNN baseline model showed solid performance in segmenting gliomas, achieving a Dice coefficient of 0.5942 (a measure of overlap between predicted and actual tumor areas). It accurately segmented larger gliomas, capturing their boundaries and general shapes well. However, it faced challenges with smaller or irregularly shaped tumors, sometimes overestimating or under-representing their extent, or even segmenting broader brain regions.
The general LLM’s performance in all three segmentation tasks (center point, bounding box, and bounding polygon) was notably poor. Predicted points and boxes were often scattered or clustered near the image center, showing little to no sensitivity to the actual tumor’s size, location, or shape. Dice coefficients were very low (e.g., 0.1219 for bounding box, 0.0412 for bounding polygon), indicating minimal overlap with true tumor areas. Hausdorff distances were high, reflecting significant localization errors.
Fine-tuning the LLM for segmentation also resulted in minimal improvement. Predictions remained largely random and centered, failing to align with actual tumor locations or sizes. While the models sometimes produced more consistent output formats after fine-tuning, the accuracy of the segmentations did not meaningfully improve.
Also Read:
- Assessing Multimodal AI for Adolescent Scoliosis Care: A ‘Divide and Conquer’ Approach
- Geometry-Guided AI Enhances Multi-View Mammography Analysis
Conclusion and Future Outlook
The study concludes that, in their current form, traditional Convolutional Neural Networks significantly outperformed Large Language Models in both brain tumor classification and segmentation tasks. CNNs demonstrated superior accuracy, robustness, and the ability to handle class imbalances and complex spatial relationships in medical image data. LLMs, despite their versatility, showed limited spatial understanding and minimal benefits from the fine-tuning strategies employed in this research.
The findings underscore the importance of domain-specific models for specialized medical imaging tasks. While LLMs hold potential for healthcare AI, more rigorous fine-tuning, larger datasets, and alternative training strategies that better integrate spatial and multi-modal information will be necessary for them to achieve comparable performance and utility in image-based medical applications.


