Introducing BlueLM-2.5-3B: A Powerful Multimodal Language Model for On-Device Use

TLDR: BlueLM-2.5-3B is a compact, 2.9B-parameter multimodal AI model from vivo AI Lab, designed for efficient edge device deployment. It uniquely supports “thinking” and “non-thinking” modes with controllable token budgets. Developed with advanced data curation, hybrid reinforcement learning, and optimized infrastructure, it achieves strong multimodal and text performance comparable to larger models, demonstrating high data efficiency.

The vivo AI Lab has unveiled BlueLM-2.5-3B, a new compact and unified dense Multimodal Large Language Model (MLLM) specifically designed for efficient deployment on edge devices like mobile phones, vehicles, and robots. This model stands out as the first 3-billion-parameter scale MLLM to support both “thinking” and “non-thinking” modes, offering explicit control over the thinking token budget, which is crucial for managing latency on resource-constrained devices.

BlueLM-2.5-3B was developed using a sophisticated approach that includes diversified data curation, strategic data resampling, a hybrid heterogeneous reinforcement learning framework, and a high-performance training infrastructure. Despite its compact size of only 2.9 billion parameters, the model achieves impressive multimodal capabilities while maintaining strong performance in pure-text tasks.

In its “thinking mode,” BlueLM-2.5-3B demonstrates performance comparable to larger models like Qwen3-4B on text-only benchmarks. For multimodal evaluations, it trails the much larger Kimi-VL-A3B-16B by only about 5% on average. When operating in “non-thinking mode,” it surpasses Qwen2.5-VL-3B on most multimodal benchmarks. A notable achievement is its exceptional data efficiency, as it achieves this performance with significantly less total training data compared to Qwen2.5-VL-3B and Qwen3-4B.

Model Architecture and Training Innovations

The architecture of BlueLM-2.5-3B consists of three main components: a Vision Transformer (ViT) for processing visual data, an adapter module to align visual and language representations, and a dense Large Language Model (LLM) backbone. The model is designed for compactness, featuring 22% fewer parameters than comparable models like Qwen2.5-VL-3B.

The training process is multi-phased, starting with pure-text pre-training to initialize the LLM, followed by a multimodal pre-training phase. This includes a joint general pre-training stage, a reasoning-enhanced stage that incorporates a substantial amount of synthesized image-text reasoning data, and a joint fast-decay and long-context activation stage to improve long-context capabilities. Post-training stages involve supervised fine-tuning with a unique “thinking mode fusion” mechanism, allowing the model to adapt its reasoning based on user queries. A special token, [ |BlueThink|], acts as a control switch for activating the long-thinking mode, which has shown remarkable stability with a failure rate of less than 1 PPM.

Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards were also integrated to refine the model’s responses, particularly for open-ended questions and reasoning problems. An innovative length penalty mechanism, termed “Group Overlong,” helps reduce redundant reasoning steps, making the model more efficient for edge deployment by curbing “overthinking” behavior.

Also Read:

Data and Infrastructure

The model’s robust performance is underpinned by a meticulously constructed data pipeline. The pre-training data has been significantly expanded, with pure text accounting for 40% of the multimodal pre-training data. This includes diverse categories such as image captioning, OCR, Visual Question Answering (VQA), and Graphical User Interface (GUI) data. A large-scale corpus specifically designed for reasoning tasks, including synthetic data, further enhances the model’s capabilities.

The training infrastructure boasts an on-premises GPU cluster with thousands of high-performance GPUs, interconnected by a high-efficiency InfiniBand network. An in-house training framework based on Megatron-LM was developed, focusing on efficiency, scalability, stability, and observability. Innovations like multi-sample concatenation and context parallelism for long-sequence training significantly boost throughput. The RL infrastructure also saw customized development and performance optimization, including one-step asynchronous RL and inference engine load balancing.

The comprehensive evaluation across over 20 benchmark datasets confirms BlueLM-2.5-3B’s strong performance in both multimodal and pure-text reasoning, visual perception, and GUI agent grounding. This work by vivo AI Lab represents a significant step towards high-performance, on-device MLLMs. For more technical details, you can refer to the full research paper: BlueLM-2.5-3B Technical Report.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Introducing BlueLM-2.5-3B: A Powerful Multimodal Language Model for On-Device Use

Model Architecture and Training Innovations

Data and Infrastructure

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates