TLDR: A study evaluated three commercial large language model (LLM) tools for summarizing veterinary oncology records. Using an LLM-as-a-judge framework, Product 1 (Hachiko), a proprietary veterinary-specific model, significantly outperformed two other commercial platforms in factual accuracy, completeness, chronological order, clinical relevance, and organization, demonstrating the critical importance of domain-specific training for LLMs in veterinary medicine. The evaluation framework itself also proved highly reproducible.
Large Language Models (LLMs) are rapidly transforming various professional fields, including healthcare. Their potential to enhance efficiency in information retrieval, support decision-making, and automate content generation is immense. However, the application and effectiveness of these advanced AI tools in veterinary medicine have remained largely unexplored, especially concerning their ability to accurately understand and generate language within the unique context of veterinary terminology and clinical scenarios.
A recent study, titled Context Matters: Comparison of commercial large language tools in veterinary medicine, aimed to bridge this knowledge gap by evaluating and comparing three commercially available LLM-powered tools designed for summarization tasks in veterinary medicine. The research was conducted by Tyler J Poore, Christopher J Pinard, Aleena Shabbir, Andrew Lagree, Andre Telfer, and Kuan-Chuen Wu.
The Challenge of General-Purpose LLMs in Veterinary Contexts
The study highlights a critical issue: while many commercial LLM platforms are emerging for veterinary applications, a comprehensive framework for comparing their performance has been lacking. Existing models often rely on prompt engineering of publicly available general-purpose LLMs or integration with validated vector databases, rather than being specifically trained on veterinary data. Previous attempts to adapt human medical models to veterinary records have shown poor results, underscoring the need for domain-specific solutions.
Methodology: An LLM-as-a-Judge Approach
To evaluate the tools, researchers used 42 de-identified veterinary oncology records. These records were submitted to three commercial summarization platforms: Product 1 (Hachiko), a proprietary veterinary-specific language model pipeline developed with domain-specific training, and two other commercial platforms (Product 2 and Product 3) that offer PDF medical record summarization capabilities. For consistency, the “Standard” summary option was selected for each platform, and a uniform instruction was used where custom prompts were allowed: “Could you provide a detailed medical history of the patient, including the age, breed, all diagnoses, bloodwork, and test results?”
An automated grading system was developed using Google’s Gemini 2.5 Pro model as an impartial judge. This LLM judge evaluated each summary against five weighted criteria, developed in consultation with a board-certified veterinary clinician: Factual Accuracy (weight 2.5), Completeness (weight 1.2), Chronological Order (weight 1.0), Clinical Relevance (weight 1.5), and Organization (weight 0.8). Each criterion was scored on a scale of 1 (Poor) to 5 (Excellent). To ensure the reliability of the grading framework, the entire dataset was run in triplicate, and the standard deviations of the scoring outputs were evaluated.
Key Findings: Domain-Specific Training Makes a Difference
The results demonstrated a significant difference in performance among the evaluated platforms:
-
Product 1 (Hachiko) achieved the highest overall performance with a median weighted score of 4.61 (Interquartile Range, IQR: 0.73). It also received perfect median scores in Factual Accuracy and Chronological Order, and consistently high scores across all other categories (Clinical Relevance, Completeness, and Organization).
-
Product 2 had a median score of 2.55 (IQR: 0.78), showing moderate performance.
-
Product 3 performed the lowest with a median score of 2.45 (IQR: 0.92), exhibiting a wider spread of scores and inconsistent outputs, particularly struggling with Chronological Order and Organization.
The smaller IQR for Product 1 indicated greater consistency and reliability in its performance. Furthermore, the LLM-based grading framework itself demonstrated high internal consistency and reproducibility, with low standard deviations across all platforms and categories, bolstering confidence in its utility for benchmarking veterinary LLM-generated summaries.
Implications for Veterinary AI Adoption
This study underscores the critical importance of developing LLM pipelines explicitly trained and designed for veterinary data. Product 1’s superior performance highlights the benefits of a domain-specific approach over relying on generalized LLM outputs. For platforms using general-purpose LLMs, careful prompt engineering and targeted post-processing are essential to ensure clinically useful content.
While LLM-based tools offer significant potential for efficiency and decision-making support in veterinary practice, practitioners must remain cautious about their limitations, especially regarding factual accuracy and reliability. The observed variability across platforms emphasizes the need for critical evaluation of outputs and that tool selection should be guided by clinical context and performance data. Future research should expand beyond oncology, incorporate larger and more diverse datasets, and validate automated assessments against expert veterinary clinician evaluations to establish inter-rater reliability.
Also Read:
- Predicting Animal Health Outcomes: A New AI Framework for Veterinary Drug Safety
- Next Event Prediction: Enhancing AI’s Understanding of Patient Journeys in Electronic Health Records
Conclusion
The comparative evaluation clearly shows that not all LLM platforms perform equally in veterinary medicine. Veterinary-trained models, like Product 1, offer clear benefits in producing more accurate, organized, and clinically relevant outputs. As LLMs become more integrated into veterinary practice, rigorous validation, transparency in model development, and critical appraisal of outputs will be crucial for their safe and effective adoption in animal healthcare.


