Empowering Customer Support: How Airbnb's Agent-in-the-Loop System Drives Continuous AI Improvement

TLDR: Airbnb’s Agent-in-the-Loop (AITL) framework uses real-time human feedback from customer support agents to continuously improve LLM-based systems. By integrating agent preferences, adoption decisions, knowledge relevance checks, and missing knowledge identification directly into live operations, AITL reduces model update cycles from months to weeks. A pilot showed significant gains in retrieval accuracy (+11.7% recall, +14.8% precision), generation helpfulness (+8.4%), and agent adoption (+4.5%), demonstrating the effectiveness of embedding human feedback for adaptive AI in dynamic environments.

In the rapidly evolving landscape of customer support, large language models (LLMs) are becoming indispensable tools. However, these models often struggle to keep pace with ever-changing product features, customer preferences, and company policies. Traditional methods of updating LLMs, which rely on infrequent, batch-processed annotations, can take months, leading to outdated information and reduced effectiveness.

Addressing this challenge, researchers from Airbnb have introduced an innovative framework called Agent-in-the-Loop (AITL). This system establishes a continuous “data flywheel” that integrates human feedback directly into live customer support operations, enabling LLM-based systems to learn and improve at an unprecedented rate.

The Agent-in-the-Loop Framework

AITL moves beyond standard offline annotation processes by embedding four crucial types of feedback directly into the daily workflow of customer support agents:

Pairwise response preferences: Agents indicate which of two suggested LLM responses is better.
Agent adoption decisions and rationales: Agents explain why they chose to use or modify an LLM-generated response.
Knowledge relevance checks: Agents verify if the information retrieved by the LLM is actually helpful and accurate for the customer’s query.
Identification of missing knowledge: Agents flag when essential information, like new policies or best practices, is not available in the system’s knowledge base.

These real-time feedback signals are then seamlessly fed back into the model update process. This drastically cuts down retraining cycles from several months to just a few weeks, ensuring the LLM system remains current and highly effective.

How AITL Works

The AITL architecture involves several key steps. When a customer sends a query, the LLM-based system retrieves relevant knowledge and generates response candidates. Support agents then evaluate these suggestions, providing the four types of annotations mentioned above. These annotations are reviewed by both human experts and an LLM-based verifier to ensure quality. Finally, this collected feedback is integrated into a continuous learning pipeline, where retrieval, ranking, and generation models are periodically retrained and evaluated, completing the flywheel.

A crucial component of AITL is its Unified Knowledge Base, which consolidates diverse resources like customer guides, FAQs, internal policies, and historical cases. This rich, metadata-enhanced knowledge base facilitates real-time annotation and retrieval for agents.

Significant Improvements in a Production Pilot

A production pilot of the AITL framework was conducted with 40 US-based customer support agents. The results were compelling, demonstrating significant improvements across several key metrics:

Retrieval Accuracy: A substantial increase of 11.7% in recall@75 and 14.8% in precision@8, meaning the system became much better at finding relevant information.
Generation Quality: An 8.4% improvement in helpfulness, indicating that the LLM-generated responses were more useful to customers.
Agent Adoption Rates: A 4.5% increase in agents choosing to use the LLM’s suggestions, highlighting greater trust and utility.
Citation Correctness: A remarkable 38.1% improvement, ensuring responses were grounded in accurate sources.

These outcomes underscore the power of embedding human feedback directly into operational workflows for continuous refinement of LLM-based customer support systems. The paper, titled “Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support,” can be found at this link.

Also Read:

Optimizing Annotation and Future Directions

The research also explored ways to optimize the annotation process. An ablation study on annotation timing revealed that while identifying missing knowledge benefits significantly from immediate annotation, other feedback types (preference, adoption, knowledge relevance) can be delayed without much loss in quality. This suggests a hybrid approach to balance efficiency with strict service level agreements (SLAs).

Furthermore, the study confirmed the value of an LLM-based filter in the data aggregation stage, which acts as a quality gate to minimize inconsistencies and hallucinations, particularly improving retrieval recall and citation accuracy.

Looking ahead, the authors propose scaling optional agent feedback through lightweight micro-annotations and active sampling, integrating AITL more deeply into agent-facing tools to evaluate productivity, and moving towards fuller automation by leveraging simulations and AI judges while retaining human oversight for critical aspects like safety and policy adherence.

While AITL presents clear advantages, the authors acknowledge limitations, including potential agent fatigue from prolonged real-time annotations, the study’s focus on English-language support, and the relatively short duration of the experiment, which limits understanding of long-term scalability and evolution of annotation practices.

Overall, the AITL framework represents a significant step forward in making LLM-based customer support systems more adaptive, accurate, and continuously improving by effectively harnessing the invaluable insights of human agents.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Empowering Customer Support: How Airbnb’s Agent-in-the-Loop System Drives Continuous AI Improvement

The Agent-in-the-Loop Framework

How AITL Works

Significant Improvements in a Production Pilot

Optimizing Annotation and Future Directions

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates