TLDR: any4 is a novel 4-bit weight quantization method for large language models (LLMs) that learns arbitrary numeric representations. It achieves superior accuracy compared to existing 4-bit techniques (int4, fp4, nf4) across various LLM families and sizes, without requiring pre-processing of weights or activations. The research also introduces tinygemm, a GPU matrix multiplication library optimized for low-latency LLM inference, which efficiently implements any4. A key innovation is the ability to calibrate any4 using a single, diverse data sample, significantly simplifying the process.
Large Language Models (LLMs) are powerful, but their size often makes them challenging to run efficiently, especially on devices with limited memory or for fast inference. A key technique to address this is quantization, which reduces the precision of the model’s weights, making them smaller and faster to process. However, traditional 4-bit quantization methods often compromise accuracy or require complex pre-processing steps.
A new research paper introduces a novel solution called any4, a learned 4-bit weight quantization method designed specifically for LLMs. Unlike previous approaches, any4 can create arbitrary numeric representations without needing to pre-process the model’s weights or activations. This flexibility allows it to adapt more effectively to the unique characteristics of LLM weights.
Superior Accuracy and Efficiency
The researchers evaluated any4 across a variety of LLMs, including Llama 2, Llama 3, Mistral, and Mixtral, and found that it consistently delivers higher accuracy compared to other common 4-bit numeric formats like int4, fp4, and nf4. What’s more, any4 achieves this without the need for additional pre-processing techniques, making it simpler to implement. It even competes favorably with more complex methods that do require such pre-processing, like AWQ and GPTQ.
The paper also explores the effectiveness of any4 at even lower bitwidths, demonstrating competitive performance with any3 (3-bit) and any2 (2-bit) quantization. A significant practical advantage of any4 is its calibration process: it can be effectively calibrated using just a single, carefully chosen diverse sample of data, rather than the hundreds of samples typically required by other quantization approaches. This drastically simplifies and speeds up the calibration step.
Introducing tinygemm for Faster Inference
To ensure efficient execution of any4 and other quantization methods, the researchers have open-sourced tinygemm, a GPU matrix multiplication library. This library is specifically optimized for low-latency LLM inference, particularly for small batch sizes (1 to 16) on Nvidia Ampere generation GPUs and newer. tinygemm implements any4 using a GPU-efficient lookup table (LUT) strategy, which helps maintain speed despite the custom numeric representations.
The library’s design focuses on minimizing memory latency by arranging matrix data in a format that tensor cores can directly use, avoiding the need for on-the-fly transpositions in shared memory for small batch sizes. While int4 kernels in tinygemm show the highest speedup (nearly 3x), any4 and nf4 still achieve significant speedups of up to 2x compared to standard bfloat16 implementations, demonstrating their practical benefits for real-world LLM deployment.
Also Read:
- Efficient AI at the Edge: How Quantization-Aware Training Transforms State-Space Models
- SingLoRA: A Streamlined Approach to Stable and Efficient Model Fine-Tuning
A Step Forward for LLM Deployment
The development of any4 and tinygemm represents a notable advancement in making LLMs more accessible and efficient. By providing a highly accurate and flexible 4-bit quantization solution that is also optimized for fast inference, this work contributes significantly to reducing the computational demands of large language models. The open-sourcing of the code at https://github.com/facebookresearch/any4 will allow the broader research community to integrate and build upon these innovations.


