any4: A New Approach to 4-bit Quantization for Large Language Models

TLDR: any4 is a novel 4-bit weight quantization method for large language models (LLMs) that learns arbitrary numeric representations. It achieves superior accuracy compared to existing 4-bit techniques (int4, fp4, nf4) across various LLM families and sizes, without requiring pre-processing of weights or activations. The research also introduces tinygemm, a GPU matrix multiplication library optimized for low-latency LLM inference, which efficiently implements any4. A key innovation is the ability to calibrate any4 using a single, diverse data sample, significantly simplifying the process.

Large Language Models (LLMs) are powerful, but their size often makes them challenging to run efficiently, especially on devices with limited memory or for fast inference. A key technique to address this is quantization, which reduces the precision of the model’s weights, making them smaller and faster to process. However, traditional 4-bit quantization methods often compromise accuracy or require complex pre-processing steps.

A new research paper introduces a novel solution called any4, a learned 4-bit weight quantization method designed specifically for LLMs. Unlike previous approaches, any4 can create arbitrary numeric representations without needing to pre-process the model’s weights or activations. This flexibility allows it to adapt more effectively to the unique characteristics of LLM weights.

Superior Accuracy and Efficiency

The researchers evaluated any4 across a variety of LLMs, including Llama 2, Llama 3, Mistral, and Mixtral, and found that it consistently delivers higher accuracy compared to other common 4-bit numeric formats like int4, fp4, and nf4. What’s more, any4 achieves this without the need for additional pre-processing techniques, making it simpler to implement. It even competes favorably with more complex methods that do require such pre-processing, like AWQ and GPTQ.

The paper also explores the effectiveness of any4 at even lower bitwidths, demonstrating competitive performance with any3 (3-bit) and any2 (2-bit) quantization. A significant practical advantage of any4 is its calibration process: it can be effectively calibrated using just a single, carefully chosen diverse sample of data, rather than the hundreds of samples typically required by other quantization approaches. This drastically simplifies and speeds up the calibration step.

Introducing tinygemm for Faster Inference

To ensure efficient execution of any4 and other quantization methods, the researchers have open-sourced tinygemm, a GPU matrix multiplication library. This library is specifically optimized for low-latency LLM inference, particularly for small batch sizes (1 to 16) on Nvidia Ampere generation GPUs and newer. tinygemm implements any4 using a GPU-efficient lookup table (LUT) strategy, which helps maintain speed despite the custom numeric representations.

The library’s design focuses on minimizing memory latency by arranging matrix data in a format that tensor cores can directly use, avoiding the need for on-the-fly transpositions in shared memory for small batch sizes. While int4 kernels in tinygemm show the highest speedup (nearly 3x), any4 and nf4 still achieve significant speedups of up to 2x compared to standard bfloat16 implementations, demonstrating their practical benefits for real-world LLM deployment.

Also Read:

A Step Forward for LLM Deployment

The development of any4 and tinygemm represents a notable advancement in making LLMs more accessible and efficient. By providing a highly accurate and flexible 4-bit quantization solution that is also optimized for fast inference, this work contributes significantly to reducing the computational demands of large language models. The open-sourcing of the code at https://github.com/facebookresearch/any4 will allow the broader research community to integrate and build upon these innovations.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

any4: A New Approach to 4-bit Quantization for Large Language Models

Superior Accuracy and Efficiency

Introducing tinygemm for Faster Inference

A Step Forward for LLM Deployment

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates