TLDR: A new research paper introduces “Dynamic Grouping,” a novel method for binary quantization of Large Language Models (LLMs). It uses adaptive grouping strategies to compress model weights to an average of 1.007 bits while maintaining high model quality, outperforming previous 1-bit methods and competing with 4-bit quantization. The process is highly efficient, enabling faster and more memory-friendly LLM deployment.
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of Natural Language Processing (NLP) tasks. However, their immense size and complexity demand substantial memory and computational resources, posing significant challenges for deployment, especially on resource-constrained devices like mobile phones and laptops.
To address this, researchers are continuously developing model compression methods. Among these, quantization stands out as a particularly promising approach. Quantization reduces the numerical precision of a model’s weights, thereby decreasing memory requirements and accelerating inference speeds. While 4-bit quantization has achieved considerable success in compressing LLMs with minimal performance degradation, the ever-increasing scale of these models necessitates even more aggressive compression techniques, such as binary quantization.
Binary quantization is an extreme form of compression that reduces model weights from 16-bit Brain Float to a 1-bit representation (typically -1 or 1). Historically, achieving satisfactory performance with such aggressive 1-bit quantization has been a significant hurdle, often leading to a notable decline in model quality compared to more conservative methods.
A new research paper, titled “Binary Quantization For LLMs Through Dynamic Grouping,” introduces a novel optimization objective and three innovative algorithms designed to overcome these limitations. The authors, Xinzhe Zheng, Zhen-Qun YANG, Haoran Xie, S. Joe Qin, Arlene Chen, and Fangzhen Lin, propose a method that enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. This approach moves beyond the narrow reliance on uniform blocking techniques and computationally intensive salient weight identification methods (like Hessian calculations) used in previous research.
The Core Innovation: Dynamic Grouping
The central idea behind this research is to minimize the total quantization loss across all unstructured sub-matrices by identifying their optimal grouping based on a predetermined quantization loss. The paper introduces three distinct algorithms to realize this objective effectively:
-
Dynamic Grouping: This algorithm employs classic dynamic programming to systematically explore all possible groupings, guaranteeing an optimal solution. While theoretically sound, its computational complexity makes it impractically slow for the large matrices found in modern LLMs.
-
Greedy Grouping: As an approximation of Dynamic Grouping, this algorithm uses a heuristic strategy to iteratively merge groups. It offers improved computational efficiency while maintaining reasonable solution quality, making it more feasible for practical applications.
-
Windowed Greedy Merging (WGM): This is an even more efficient approximation, designed to strike a strong balance between quantization performance and speed. Instead of starting with individual elements, it begins with initial groups of a specified window size, further accelerating the merging process. The authors found WGM to be the most practical solution for contemporary LLM architectures.
Impressive Experimental Results
The experimental results presented in the paper are highly compelling. The Windowed Greedy Merging-LLM (WGM-LLM) approach achieved an average bit length of just 1.007 bits, demonstrating an exceptional level of compression. Despite this aggressive quantization, the method maintained high model quality. For example, their quantized LLaMA 3.2 3B model attained a perplexity of 8.23, remarkably close to the original full-precision model’s 7.81. This significantly surpasses previous state-of-the-art binary LLM methods, which often resulted in much higher perplexity values (e.g., BiLLM with a perplexity of 123.90 for a similar model).
Furthermore, WGM-LLM proved to be competitive with leading 4-bit quantization approaches, such as GPTQ, in both performance and efficiency. In several common sense QA tasks, WGM-LLM even outperformed GPTQ on some models, showcasing its ability to balance extreme compression with high accuracy.
The efficiency of the compression process is another highlight. Quantizing the full LLaMA 3.2 3B weights required only 14 seconds on a single CPU core, with the entire process completing in under 100 minutes. The method also exhibits embarrassingly parallel properties, suggesting even faster quantization times with adequate CPU resources.
Also Read:
- ZeroQAT: A New Approach for Practical Low-Bit Quantization in Large Language Models
- KVComp: Boosting LLM Performance with Smart KV Cache Compression
Future Outlook
While the current study primarily relies on simulations due to the absence of specialized hardware and kernels designed for 1-bit operations and arbitrary partitioning, this research marks a significant advancement in binary quantization for LLMs. The authors acknowledge these limitations and highlight the crucial need for future development of tailored kernels and hardware to fully unlock the potential of binary LLMs in terms of actual bit width and inference acceleration.
This work pushes the boundaries of binary quantization, demonstrating its potential to make powerful LLMs more efficient and accessible for deployment on a wider range of constrained devices. The full research paper can be accessed here: Binary Quantization For LLMs Through Dynamic Grouping.


