TLDR: Apple has introduced two new multilingual, multimodal foundation language models that power Apple Intelligence features: a compact on-device model optimized for Apple silicon and a scalable server model built on a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer. These models, trained on diverse and responsibly sourced data, support multiple languages, understand images, and execute tool calls. They are designed with architectural innovations for efficiency and quality, including KV-cache sharing and advanced quantization techniques. A new Swift-centric Foundation Models framework allows developers to integrate these capabilities, while Apple’s Responsible AI principles ensure user privacy and safety.
Apple has unveiled the foundational language models powering its new Apple Intelligence features, marking a significant step in integrating generative AI across its devices and services. This initiative, introduced at the 2025 Worldwide Developers Conference, aims to enhance user experience while prioritizing privacy. The core of this advancement lies in two distinct yet complementary models: a compact on-device model and a powerful server-based model.
Two Models for Diverse Needs
The first model is a roughly 3-billion-parameter on-device model, meticulously optimized for Apple silicon. Its design incorporates architectural innovations like KV-cache sharing, which significantly reduces memory usage and speeds up the time it takes to generate the first token of a response. It also utilizes 2-bit quantization-aware training, a technique that compresses the model while maintaining quality, making it highly efficient for local processing.
The second is a scalable server model, built upon a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer. This architecture combines track parallelism, sparse computation, and interleaved global–local attention. This sophisticated design allows the server model to deliver high-quality results with competitive cost on Apple’s Private Cloud Compute platform, ensuring robust performance for more complex tasks.
Understanding the World Through Data
Both models are trained on vast, diverse datasets that include responsible web crawling, licensed corpora, and high-quality synthetic data. Apple emphasizes that no private user data or interactions are used in training these foundation models, reinforcing its commitment to privacy. The web crawling strategy, powered by Applebot, focuses on high-quality, diverse content across numerous languages and locales, with careful attention to ethical practices like respecting robots.txt protocols.
To enable visual understanding, the models also incorporate extensive image data. This includes billions of image-text pairs sourced from web crawls, along with over 5 billion synthetically generated image-caption pairs that provide richer, more detailed descriptions. Specialized text-rich image data, such as PDFs, infographics, and charts, are also used to help the models understand text embedded within images, crucial for features like adding events from a flyer to a calendar.
Training and Optimization for Peak Performance
The training process for these models is multi-staged and highly refined. The text tokenizer was expanded to support more languages, increasing its vocabulary size. The vision encoder undergoes a two-stage training process, first with contrastive pre-training and then joint training with an LLM decoder to align image features with the language model’s representation space.
Post-training involves Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT combines human-written demonstrations and synthetic data, focusing on areas like general knowledge, reasoning, text-rich image understanding, multilingual Optical Character Recognition (OCR), and visual grounding. RLHF, using a distributed asynchronous infrastructure, further refines the models based on diverse reward signals, leading to significant improvements in human preferences and reasoning capabilities.
To ensure efficiency without compromising quality, Apple has implemented advanced optimization techniques. The on-device model uses Quantization-Aware Training (QAT) to compress its weights to 2 bits, while the server model employs Adaptive Scalable Texture Compression (ASTC) for 3.56 bits-per-weight compression. To recover any quality loss from this compression, Low-Rank Adaptation (LoRA) adapters are applied and fine-tuned, allowing the models to maintain high performance.
Empowering Developers with a New Framework
A new Swift-centric Foundation Models framework provides developers with direct access to the on-device language foundation model. This framework simplifies the integration of generative AI capabilities through features like guided generation, which allows developers to directly generate rich Swift data structures, and constrained tool calling, which ensures the structural correctness of tool invocations. The framework also offers a state-full session called LanguageModelSession, designed to optimize performance and context management.
Also Read:
- Seed-X: A Compact 7B Language Model Achieving Top-Tier Multilingual Translation
- Enterprises Pivot to Small Language Models for Targeted AI Success
Rigorous Evaluation and Responsible AI
Apple conducted extensive evaluations, comparing its models against publicly accessible benchmarks like MMLU, MMMLU, and MGSM, as well as human evaluations across various language and reasoning capabilities. The on-device model performs favorably against comparably sized models, while the server model shows strong performance, though it lags behind much larger proprietary models.
Central to Apple’s approach is its commitment to Responsible AI. This is guided by principles such as empowering users, representing global users, designing with care, and protecting privacy. Safeguards like content filtering, locale-specific evaluation, and a comprehensive safety taxonomy are integrated throughout the development process. User feedback mechanisms are also in place to continuously improve the models and features.
These advancements in Apple’s foundation models are set to unlock a wide range of helpful features across Apple’s software platforms, making powerful AI capabilities accessible to users globally in many languages. For more technical details, you can refer to the full research paper: Apple Intelligence Foundation Language Models Tech Report 2025.


