spot_img
HomeResearch & DevelopmentAccelerating LLM Function Calling: A Data-Driven Approach to Reduce...

Accelerating LLM Function Calling: A Data-Driven Approach to Reduce Latency

TLDR: ODIA is a novel approach that uses online user interaction data to accelerate LLM-based Function Calling by automatically identifying ‘simple queries’ and distilling knowledge from larger models to smaller ones. This dual-model system significantly reduces response latency (45% expected, 78% median) and efficiently handles 60% of traffic with a smaller model while maintaining accuracy. The method requires minimal human intervention and continuously improves through automated data collection and model updating, making it a practical solution for production environments.

Large Language Models (LLMs) have become incredibly powerful, enabling applications to interact with external systems through a crucial technique called Function Calling. This allows LLMs to select appropriate functions and generate necessary parameters based on what a user asks, effectively bridging the gap between natural language and actionable operations.

However, a significant challenge with LLM-based Function Calling is its latency. Users often experience delays of 1-2 seconds, sometimes peaking at 2-3 seconds during busy periods. Unlike streaming text generation, Function Calling often feels like a ‘black box’ process, making these wait times particularly noticeable and impacting user experience.

Traditional methods for accelerating LLM inference, such as quantization, offer limited improvements for Function Calling specifically. Manually creating specialized models for simpler queries is also resource-intensive and hard to scale.

Introducing ODIA: A Smart Approach to Speed Up Function Calling

Researchers from ByteDance Inc. have introduced a novel approach called Oriented Distillation for Inline Acceleration (ODIA) to tackle this latency problem. ODIA leverages real-world user interaction data to automatically accelerate Function Calling. The core idea is to identify ‘simple queries’ from live user traffic and then transfer knowledge from larger, more complex LLMs to smaller, faster ones.

This method has shown impressive results, reducing response latency by an expected 45% and a median of 78%, all while maintaining high accuracy. In a real-world deployment within a music application, the smaller model successfully handled 60% of the user traffic with almost no loss in accuracy. A key advantage of ODIA is its minimal need for human intervention; it continuously improves through automated data collection and model updates, making it highly practical for production environments.

How ODIA Works: The Dual-Model System

ODIA operates with two main components: an offline pipeline and an online serving system. The offline pipeline is responsible for automatically identifying ‘simple queries’ from production data and training specialized models. The online system then efficiently directs incoming user queries to either the small or the large model based on their complexity.

The process involves several steps:

  • Data Accumulation: Collecting Function Calling data from live traffic, including user queries, conversation history, and tool definitions.
  • Automatic Filtering: Using algorithms to identify ‘simple’ queries that smaller models can reliably handle.
  • Small Model Training: Training two specialized models: an intent classification model to quickly decide if a query is ‘simple,’ and a parameter generation model to produce Function Calling outputs for these simple queries.
  • Online Acceleration: In live production, the classification model routes queries, with complex ones falling back to the larger, more powerful LLM.

Defining ‘Simple’ Queries

A crucial aspect of ODIA is accurately defining what constitutes a ‘simple’ query. Initially, it was thought that queries consistently calling the same function would be simple. However, even semantically similar queries might call different functions due to LLM instability. The refined definition focuses on clusters of semantically similar queries that consistently result in calls to the same function. For example, phrases like “play Jay Chou’s songs” and “play some music by Jay Chou” consistently trigger the same audio search function, making them simple. Conversely, ambiguous queries like “switch” or context-dependent ones like “more of these” are considered complex.

Identifying Similar Queries

To group similar queries, ODIA uses two complementary approaches:

  • Semantic Similarity-Based Clustering: User queries are converted into vector representations using embedding models. These are then grouped using hierarchical clustering, effectively grouping queries with similar meanings despite different phrasing.
  • Named Entity Recognition (NER) Based Clustering: This approach extracts entities (like song titles, artists) and converts queries into templates (e.g., “play ’s ”). This allows the system to learn general patterns rather than just specific phrases.

The Specialized Models

The ODIA system employs two key models:

  • Intent Routing Model: This model acts as a gatekeeper, deciding whether a query can be handled by the small model or needs to be routed to the large one. It must be highly accurate (over 95%) and extremely fast (under 50ms).
  • Parameter Generation Model: This is the smaller, faster model that actually performs the Function Calling for simple queries. It balances context understanding with a low latency requirement (under 300ms). The researchers selected deepseek-coder-1.3B for this role due to its optimal balance of performance and accuracy.

Continuous Improvement and Optimization

ODIA is designed for continuous improvement through incremental updates, processing data in batches to unify initial training with daily updates. This allows the system to adapt to new query patterns and maintain high coverage over time.

Beyond model substitution, the team also optimized the small model’s performance by streamlining system prompts, simplifying output formats, and optimizing token usage by replacing multi-token parameter names with single-token alternatives. For instance, changing “media_type” to “type” reduced token count without losing meaning.

Also Read:

Real-World Impact

The deployment of ODIA in a production music application yielded significant results. The small model, handling about 60% of traffic, led to an expected latency reduction of 45% (from 1600ms to 870ms) and a median latency reduction of 78% (from 1600ms to 350ms). Crucially, this was achieved with negligible degradation in accuracy for function selection and parameter extraction. Furthermore, using the smaller 1.3B parameter model significantly reduced computational resource requirements, leading to cost savings.

The success of ODIA demonstrates a practical and effective way to accelerate LLM Function Calling in real-world applications. By intelligently leveraging online user data and knowledge distillation, this approach offers a promising direction for balancing performance, efficiency, and accuracy as LLMs become more integrated into our daily lives. You can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -