TLDR: This paper introduces highly optimized 1-bit and 2-bit microkernels for modern CPUs, integrated into the PyTorch-TPP framework. This new runtime significantly accelerates ultra-low-bit LLM inference on AI PCs and edge devices, outperforming existing solutions like bitnet.cpp by up to 2.2x and achieving up to 7x speedup over 16-bit models. The research demonstrates CPU capabilities approaching GPU performance for these models, paving the way for more efficient LLM deployment on resource-constrained hardware.
The world of Artificial Intelligence is rapidly evolving, with Large Language Models (LLMs) becoming increasingly powerful. However, running these complex models efficiently, especially on personal computers and edge devices, has been a significant challenge. Traditional LLMs often require substantial memory and computational power, limiting their widespread deployment.
Recent breakthroughs in quantization have led to the development of “ultra-low-bit” LLMs, which can operate at 1-bit, 1.58-bit, or 2-bit precision. These models promise to match the performance of their larger, full-precision counterparts while being far more efficient in terms of latency, memory usage, throughput, and energy consumption. This advancement is particularly exciting for AI PCs and other resource-constrained environments.
Despite these promising developments, the software runtimes used to deploy these ultra-low-bit models have not kept pace. Existing state-of-the-art runtimes, such as bitnet.cpp, designed for ternary LLMs, have shown limitations. Preliminary analysis revealed that 2-bit inference with bitnet.cpp could even be slower than 4-bit inference on CPUs, indicating a significant gap in optimization.
Researchers from Intel Corporation, Evangelos Georganas, Dhiraj Kalamkar, and Alexander Heinecke, addressed this challenge with a novel approach. Their work, detailed in their research paper “Pushing the Envelope of LLM Inference on AI-PC”, takes a “bottom-up” strategy to optimize LLM inference for modern CPUs.
Optimized Microkernels for CPUs
The core of their innovation lies in designing and implementing highly optimized 1-bit and 2-bit microkernels specifically for modern CPUs. These microkernels are the fundamental building blocks for performing matrix multiplications, a critical operation in LLM inference. They focused on achieving peak computational efficiency across various CPU platforms.
A key aspect of their design for 2-bit weights is a new tensor layout called VNNI4-interleaved. This clever arrangement of data, combined with an “up-convert and compute” technique, allows for efficient processing using hardware-accelerated instructions available on modern CPUs. For 1-bit weights, they also developed a specialized microkernel that efficiently converts 1-bit values to 8-bit for computation, ensuring optimal performance.
Integration and Performance Gains
These newly developed microkernels were integrated into PyTorch-TPP, a state-of-the-art LLM inference framework. The results are impressive. Their optimized runtime significantly outperforms the current state-of-the-art runtime, bitnet.cpp, achieving speedups of up to 2.2 times. Furthermore, compared to traditional 16-bit model inference, their solution delivers up to 7 times faster performance.
The researchers benchmarked their solution on various Intel Core Ultra CPUs, including ARL, ARLH, and LNL, demonstrating consistent performance improvements across different configurations. They also analyzed the performance using a “roofline model,” which helps understand the theoretical limits of performance and confirms that their microkernels operate very close to the maximum possible efficiency, especially for 2-bit inference.
Also Read:
- Smart LLM Adaptation: DP-LLM Adjusts Precision on the Fly
- Optimizing Large Language Model Performance with Dynamic Request Scheduling
Approaching GPU-Level Performance on CPUs
Perhaps one of the most striking findings is the comparison with GPU performance. When benchmarked against 2-bit inference on an NVIDIA A100 GPU, their optimized CPU performance was within 2.3 to 3 times that of the A100, despite the A100 having significantly more memory bandwidth (17-20 times more). This demonstrates that with proper microkernel design and runtime support, ultra-low-bit inference on CPUs can indeed approach GPU-level performance, making AI PCs a powerful platform for deploying LLMs.
This work marks a significant step forward for LLM inference on AI PCs and edge devices, paving the way for more efficient and widespread deployment of ultra-low-bit LLM models. The authors also plan to extend their work to ARM platforms in the future, further broadening the impact of their optimizations.


