New Compression Techniques Enhance Large Language Model Efficiency with 8:16 Sparsity

TLDR: A new research paper introduces an integrated methodology for compressing large language models (LLMs) more efficiently. It highlights the effectiveness of 8:16 semi-structured sparsity, which offers greater flexibility than traditional 2:4 patterns, enabling a sparse LLaMa-2-13B to match a dense LLaMa-2-7B. The paper also proposes using structured sparsity for critical ‘outlier’ weights, improving computational efficiency and performance. Additionally, novel pre- and post-processing techniques, SmoothQuant-inspired rebalancing and Variance Correction, are introduced to further mitigate performance degradation during sparsification, leading to significantly improved sparse LLM performance.

As large language models (LLMs) continue to grow in size and complexity, finding efficient ways to compress them without losing performance is crucial. Traditional compression methods like quantization, which reduces the precision of weights, have been effective. However, structured sparsity techniques, such as N:M sparsification, often face challenges due to their limited flexibility and sensitivity to important ‘outlier’ weights.

A new research paper titled “From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction” explores a more flexible approach: 8:16 semi-structured sparsity. This method allows for 8 non-zero elements within every 16, offering significantly more configuration possibilities compared to the commonly used 2:4 sparsity (which has 2 non-zero elements in 4). While 8:16 requires slightly more storage per element (0.875 bits/element vs. 0.75 bits/element for 2:4), its increased flexibility is shown to be highly beneficial for maintaining model accuracy.

The researchers, Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, and Egor Shvetsov, also tackle the issue of outlier weights. These are a small percentage of highly important weights that, if removed or altered, can significantly degrade a model’s performance. Instead of using unstructured sparsity for these outliers, which can be inefficient due to irregular memory access, the paper proposes using structured sparsity patterns (like 4:256, 8:256, or 16:256) to store them. This approach not only improves computational efficiency but also leads to better overall model performance compared to unstructured methods.

Beyond sparsity patterns, the paper introduces two key pre- and post-processing techniques to further enhance sparse model performance:

SmoothQuant-Inspired Rebalancing

This technique adapts the SmoothQuant philosophy, originally for quantization, to sparsification. It involves preprocessing weights and activations to balance their distributions before sparsification. This helps in a clearer separation of important (salient) and less important weights, making the pruning process more effective.

Also Read:

Variance Correction (VC)

A novel post-pruning adjustment, Variance Correction, aims to mitigate performance degradation caused by weight removal. It rescales pruned weights to preserve their original variance, thereby maintaining stable activation statistics within the model. This simple yet effective method is a new contribution to the field.

The research evaluated these methods on various large language models, including LLaMA2 (7B/13B), LLaMA3 (8B), and Mistral (7B), using standard text corpora like WikiText and C4, and zero-shot reasoning benchmarks. The results are promising:

The 8:16 sparsity pattern significantly improves model performance. For instance, a sparse LLaMa-2-13B model using this pattern achieved the same performance as a dense LLaMa-2-7B model, demonstrating its practical value for resource-constrained environments.
Storing salient weights using structured sparsity patterns consistently improved perplexity and accuracy across models, outperforming unstructured approaches.
The combination of SmoothQuant-inspired rebalancing, Variance Correction, and blockwise fine-tuning (EBFT) achieved the lowest perplexity scores, highlighting the effectiveness of their integrated methodology.
The study also revealed that different models exhibit varying robustness to pruning. Mistral, for example, showed greater inherent robustness compared to LLaMA3, and Variance Correction benefited LLaMA3 significantly but had a negative impact on Mistral, underscoring the importance of architectural dependencies.

In conclusion, this work demonstrates that structured sparsity, particularly the 8:16 pattern, combined with targeted techniques for salient weight preservation and optimized pre/post-processing, can enable efficient deployment of LLMs without compromising performance. While 8:16 sparsity is not yet natively supported on modern hardware, this research paves the way for future hardware development that could unlock even greater efficiency for large language models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Compression Techniques Enhance Large Language Model Efficiency with 8:16 Sparsity

SmoothQuant-Inspired Rebalancing

Variance Correction (VC)

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates