New Framework Assesses Large Language Models for Tourism Without Labeled Data

TLDR: LETToT is a new framework that evaluates large language models (LLMs) for tourism-specific questions without needing expensive labeled data. It uses expert-derived reasoning structures to assess LLM performance, showing that expert knowledge significantly improves evaluation and that smaller LLMs with reasoning capabilities can compete with larger ones in specialized domains, unlike what generic benchmarks suggest.

Evaluating large language models (LLMs) for specific applications, such as tourism question-answering, has traditionally faced significant hurdles. A major challenge is the prohibitive cost and effort required to create annotated datasets for benchmarking. Furthermore, LLMs often generate “hallucinations”—plausible but incorrect information—which undermines their reliability, especially in domains like tourism where accuracy and real-time data are crucial for practical travel planning.

To address these issues, researchers Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, and Lingbo Li have introduced a novel framework called Label-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought (LETToT). This innovative approach assesses LLMs in the tourism domain by leveraging expert-derived reasoning structures instead of relying on expensive labeled data. LETToT is specifically designed for tourism QA, where queries often demand structured reasoning, integration of user preferences, and the ability to produce coherent travel plans.

How LETToT Works: A Two-Stage Approach

The LETToT framework operates in two main stages to provide a robust, label-free evaluation of LLMs.

The first stage focuses on the iterative validation and refinement of hierarchical Tree-of-Thought (ToT) components. This involves aligning these components with generic quality dimensions and incorporating expert feedback. Essentially, LLMs are prompted with expert-derived ToT structures, and their responses are cross-validated using an LLM-judge. This systematic process helps discover and validate optimal prompts for tourism QA, leading to an interpretable grading system optimized through Analytic Hierarchy Process (AHP)-weighted scoring. This stage demonstrated significant improvements in response quality, with gains ranging from 4.99% to 14.15% over baseline prompts across various dimensions like thematic relevance and context appropriateness.

In the second stage, the optimized ToT components from the first stage serve as guidelines for label-free LLM evaluation. Here, LLMs are given simple queries without any prior expert knowledge to assess their inherent suitability for the tourism domain. The evaluation then systematically identifies and scores expert ToT elements within the model’s response using a rule-based verifiable reward formula. This formula considers both the coverage of general tourism elements (like budget management, risk assessment, and route design) and specific elements relevant to particular tourism themes (such as cultural heritage or ecological protection). An efficiency factor is also included to quantify the information density of the text, favoring concise and informative outputs.

Also Read:

Key Insights from LETToT’s Evaluation

The study evaluated five open-source LLMs, ranging from 32 billion to 671 billion parameters, revealing several important findings:

**Enhanced Performance with Expert Knowledge:** Incorporating domain-specific expert knowledge through optimized prompts significantly improved LLM performance in tourism QA, showing gains between 4.99% and 14.15% across various quality metrics. Reasoning-enhanced models, such as DeepSeek-R1-Distill-32B and DeepSeek-R1-Distill-Llama-70B, showed the most substantial improvements, demonstrating their sensitivity to expert-guided prompting.
**Scaling Laws and Reasoning Capabilities:** While larger models like DeepSeek-V3 generally lead in overall performance, the research found that smaller models with enhanced reasoning capabilities, such as DeepSeek-R1-Distill-Llama-70B, can effectively narrow this performance gap. For models under 72 billion parameters, those with explicit reasoning architectures consistently outperformed their non-reasoning counterparts in terms of accuracy and conciseness.
**Domain-Specific vs. Generic Evaluation:** A crucial insight from LETToT is the discrepancy between its domain-specific rankings and those from popular generic LLM leaderboards, like HuggingFace’s Open LLM Leaderboard. For instance, while DeepSeek reasoning models clearly outperformed Qwen models in LETToT’s tourism-specific evaluation, their positions were reversed on the generic leaderboard. This highlights LETToT’s ability to provide a more accurate and relevant assessment for specialized domains, emphasizing the importance of reasoning abilities in fields requiring deep expertise.

The LETToT framework represents a significant step forward in evaluating LLMs for specialized applications. By replacing costly manual annotations with expert-guided reasoning hierarchies, it offers a scalable and domain-oriented paradigm for assessing tourism QA systems. This framework can potentially be extended to other fields that demand deep domain expertise, paving the way for more accurate and relevant LLM evaluations beyond generic benchmarks.

For more detailed information, you can refer to the full research paper: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Framework Assesses Large Language Models for Tourism Without Labeled Data

How LETToT Works: A Two-Stage Approach

Key Insights from LETToT’s Evaluation

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Leading Foreign Automakers Secure China’s Nod for In-Car AI Chatbots

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates