spot_img
HomeResearch & DevelopmentNew Framework Assesses Large Language Models for Tourism Without...

New Framework Assesses Large Language Models for Tourism Without Labeled Data

TLDR: LETToT is a new framework that evaluates large language models (LLMs) for tourism-specific questions without needing expensive labeled data. It uses expert-derived reasoning structures to assess LLM performance, showing that expert knowledge significantly improves evaluation and that smaller LLMs with reasoning capabilities can compete with larger ones in specialized domains, unlike what generic benchmarks suggest.

Evaluating large language models (LLMs) for specific applications, such as tourism question-answering, has traditionally faced significant hurdles. A major challenge is the prohibitive cost and effort required to create annotated datasets for benchmarking. Furthermore, LLMs often generate “hallucinations”—plausible but incorrect information—which undermines their reliability, especially in domains like tourism where accuracy and real-time data are crucial for practical travel planning.

To address these issues, researchers Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, and Lingbo Li have introduced a novel framework called Label-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought (LETToT). This innovative approach assesses LLMs in the tourism domain by leveraging expert-derived reasoning structures instead of relying on expensive labeled data. LETToT is specifically designed for tourism QA, where queries often demand structured reasoning, integration of user preferences, and the ability to produce coherent travel plans.

How LETToT Works: A Two-Stage Approach

The LETToT framework operates in two main stages to provide a robust, label-free evaluation of LLMs.

The first stage focuses on the iterative validation and refinement of hierarchical Tree-of-Thought (ToT) components. This involves aligning these components with generic quality dimensions and incorporating expert feedback. Essentially, LLMs are prompted with expert-derived ToT structures, and their responses are cross-validated using an LLM-judge. This systematic process helps discover and validate optimal prompts for tourism QA, leading to an interpretable grading system optimized through Analytic Hierarchy Process (AHP)-weighted scoring. This stage demonstrated significant improvements in response quality, with gains ranging from 4.99% to 14.15% over baseline prompts across various dimensions like thematic relevance and context appropriateness.

In the second stage, the optimized ToT components from the first stage serve as guidelines for label-free LLM evaluation. Here, LLMs are given simple queries without any prior expert knowledge to assess their inherent suitability for the tourism domain. The evaluation then systematically identifies and scores expert ToT elements within the model’s response using a rule-based verifiable reward formula. This formula considers both the coverage of general tourism elements (like budget management, risk assessment, and route design) and specific elements relevant to particular tourism themes (such as cultural heritage or ecological protection). An efficiency factor is also included to quantify the information density of the text, favoring concise and informative outputs.

Also Read:

Key Insights from LETToT’s Evaluation

The study evaluated five open-source LLMs, ranging from 32 billion to 671 billion parameters, revealing several important findings:

  • **Enhanced Performance with Expert Knowledge:** Incorporating domain-specific expert knowledge through optimized prompts significantly improved LLM performance in tourism QA, showing gains between 4.99% and 14.15% across various quality metrics. Reasoning-enhanced models, such as DeepSeek-R1-Distill-32B and DeepSeek-R1-Distill-Llama-70B, showed the most substantial improvements, demonstrating their sensitivity to expert-guided prompting.
  • **Scaling Laws and Reasoning Capabilities:** While larger models like DeepSeek-V3 generally lead in overall performance, the research found that smaller models with enhanced reasoning capabilities, such as DeepSeek-R1-Distill-Llama-70B, can effectively narrow this performance gap. For models under 72 billion parameters, those with explicit reasoning architectures consistently outperformed their non-reasoning counterparts in terms of accuracy and conciseness.
  • **Domain-Specific vs. Generic Evaluation:** A crucial insight from LETToT is the discrepancy between its domain-specific rankings and those from popular generic LLM leaderboards, like HuggingFace’s Open LLM Leaderboard. For instance, while DeepSeek reasoning models clearly outperformed Qwen models in LETToT’s tourism-specific evaluation, their positions were reversed on the generic leaderboard. This highlights LETToT’s ability to provide a more accurate and relevant assessment for specialized domains, emphasizing the importance of reasoning abilities in fields requiring deep expertise.

The LETToT framework represents a significant step forward in evaluating LLMs for specialized applications. By replacing costly manual annotations with expert-guided reasoning hierarchies, it offers a scalable and domain-oriented paradigm for assessing tourism QA systems. This framework can potentially be extended to other fields that demand deep domain expertise, paving the way for more accurate and relevant LLM evaluations beyond generic benchmarks.

For more detailed information, you can refer to the full research paper: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -