TLDR: This research introduces a novel ‘projection merge’ technique for enabling compositional multi-tasking (like summarizing and translating simultaneously) in Large Language Models (LLMs) directly on mobile devices. By adding a small, learnable layer on top of existing task-specific adapters, the method achieves efficient integration and strong performance with minimal computational overhead. The team developed an Android app to demonstrate its practical viability, highlighting benefits like enhanced privacy and speed for real-world applications, such as cross-lingual conversation summarization.
Large Language Models (LLMs) have transformed how we interact with AI, generating content across text, images, and videos. While many powerful AI applications rely on remote servers, there’s a growing interest in bringing these capabilities directly to our devices, like smartphones. This shift offers significant advantages, especially enhanced privacy, as sensitive data remains securely on your device without being sent over networks.
One of the exciting frontiers in on-device AI is “compositional multi-tasking.” Imagine needing to summarize a long conversation and then translate that summary into another language, all at once. Standard approaches often struggle with such complex, simultaneous tasks. They might require extensive retraining or processing tasks one after another, which can be slow and resource-intensive.
A Novel Approach for On-Device Multi-tasking
Researchers have introduced a new method specifically designed for these compositional multi-tasking scenarios, focusing on summarization and translation. Their technique, called “projection merge,” involves adding a small, learnable projection layer on top of existing summarization and translation adapters. Adapters, like Low-Rank Adapters (LoRA), are efficient ways to fine-tune large language models for specific tasks without modifying the entire model. This new projection layer acts as a bridge, allowing the combined adapters to work together effectively.
The key benefit of this design is its efficiency. Compared to alternative strategies that might demand extensive retraining or sequential processing, the projection merge significantly reduces computational overhead. This means your device can handle complex tasks like generating a translated summary from a long conversation much faster and with fewer resources.
Building an On-Device System
To demonstrate the practical viability of their method, the team developed an Android application capable of executing these compositional tasks seamlessly on a smartphone. This fully on-device system ensures that all computations run locally, further enhancing user privacy and reducing operational costs for service providers.
The application’s architecture includes a user interface, an LLM communication endpoint, an inference API, and components for LLM setup and adapter handling. Developing such a system for mobile devices presented unique challenges. For instance, integrating adapters and loading models efficiently required modifications to existing libraries. Memory management was another hurdle, overcome by moving heavy processing tasks to a separate thread, ensuring the user interface remains responsive.
Performance and Practical Benefits
Experimental results have shown that this solution performs well and is fast, both in cloud-based and on-device implementations. The projection merge approach achieved comparable, and in some cases, even better performance than other well-performing but less efficient baselines. Crucially, it introduces only a tiny fraction of additional parameters and storage compared to training a completely new adapter for a combined task.
For example, in tests on a Samsung S23 Ultra Android device, the projection merge method achieved translated summaries in about 24 seconds, outperforming other methods. While this might still seem long for some immediate use cases, it represents a significant step forward for fully on-device AI. The modular design also allows for easy extension to additional languages and other compositional tasks, such as generating reply suggestions combined with translation or tone adjustment.
This research highlights the potential benefits of adopting this framework in real-world applications that demand high-speed operation alongside resource constraints. It’s particularly valuable for users engaging with foreign language content, such as travelers participating in local chat groups, allowing them to easily see summaries of conversations in their own language. You can read the full research paper here.
Also Read:
- K-Merge: Smarter Adapter Management for On-Device Language Models
- Reversible Model Merging: Preserving Performance in Low-Rank Compressed Models
Future Outlook
While the current implementation successfully demonstrates the feasibility of on-device compositional multi-tasking, the researchers acknowledge areas for further optimization. These include exploring more aggressive quantization techniques (to reduce model size and speed up inference) and integrating LLMs directly into mobile operating systems for even greater efficiency. Despite these ongoing challenges, this work paves the way for more private, efficient, and powerful AI experiences directly on our personal devices.


