spot_img
HomeResearch & DevelopmentDecomposing User Intent for Efficient On-Device AI

Decomposing User Intent for Efficient On-Device AI

TLDR: A new two-stage method called “Decomposed-FT” significantly improves how small, on-device AI models understand user intentions from app interactions. By first summarizing individual actions and then combining these summaries, this approach allows smaller models to achieve better accuracy, even outperforming larger, more complex AI systems, while maintaining privacy and low latency.

Understanding what users intend to do while interacting with their devices is a critical challenge for developing intelligent agents. While powerful, large language models (LLMs) excel at this, they often require significant computational resources, are costly, and raise privacy concerns as data is processed in data centers. This makes them less suitable for on-device applications where privacy, low cost, and minimal latency are paramount.

A recent research paper, titled “Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition,” introduces a groundbreaking approach to tackle this problem. Authored by Danielle Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, and Anatoly Efros from Google and Bar-Ilan University, the paper details a novel two-stage method that enables smaller, resource-constrained models to accurately infer user intent, often outperforming even larger models.

The core of their innovation lies in a decomposed strategy. Instead of feeding an entire sequence of user interactions directly to a single model, which can overwhelm smaller systems, they break the task into two manageable stages:

Stage 1: Structured Interaction Summarization

In the first stage, the model processes each individual user interaction—comprising a screenshot of the device interface and the user’s action—to create a concise summary. This summary captures key information about the screen context and the specific action taken. To enhance accuracy, the model also considers the preceding and succeeding interactions, providing crucial context to resolve ambiguities. The summaries are structured to focus on relevant details and avoid speculative interpretations of user intent.

Also Read:

Stage 2: Session-Level Intent Extraction

The summaries from all individual interactions are then aggregated and fed into a second, fine-tuned model. This model’s task is to synthesize these summaries into a single, overarching description of the user’s intent for the entire session. A crucial aspect of this stage is a technique called “label refinement” during training. This process ensures that the model learns to infer intents based solely on the information present in the interaction summaries, preventing it from generating details not supported by the input data.

The researchers evaluated their “Decomposed-FT” (Decomposed Fine-Tuned) approach using small models like Gemini 1.5 Flash 8B and Qwen2 VL 7B, comparing them against traditional methods such as Chain-of-Thought (CoT) prompting and End-to-End fine-tuning, as well as a large model baseline, Gemini 1.5 Pro. The results were compelling: the decomposed approach significantly improved intent extraction performance for small models. On the Mind2Web dataset, for instance, the fine-tuned decomposed approach allowed Gemini Flash 8B to surpass the performance of the larger Gemini 1.5 Pro model using CoT.

An ablation study further highlighted the importance of each design choice, demonstrating that incorporating context from neighboring interactions, using structured summaries, fine-tuning the second stage, and refining training labels all contribute significantly to the method’s success. While the decomposed approach introduces a slight increase in computational cost (2-3x) compared to simple small model baselines, it remains substantially more efficient and faster than relying on large MLLMs, making it practical for on-device deployment. A latency-optimized variant was also shown to address potential real-time application concerns.

This research marks a significant step towards developing more capable and privacy-preserving AI agents that can run directly on user devices. By enabling small models to achieve superior intent understanding, this method paves the way for enhanced personalization, improved work efficiency, and better recall of past activities, all while keeping sensitive user data private. For more details, you can read the full paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -