TLDR: A new research paper introduces an integrated methodology for compressing large language models (LLMs) more efficiently. It highlights the effectiveness of 8:16 semi-structured sparsity, which offers greater flexibility than traditional 2:4 patterns, enabling a sparse LLaMa-2-13B to match a dense LLaMa-2-7B. The paper also proposes using structured sparsity for critical ‘outlier’ weights, improving computational efficiency and performance. Additionally, novel pre- and post-processing techniques, SmoothQuant-inspired rebalancing and Variance Correction, are introduced to further mitigate performance degradation during sparsification, leading to significantly improved sparse LLM performance.
As large language models (LLMs) continue to grow in size and complexity, finding efficient ways to compress them without losing performance is crucial. Traditional compression methods like quantization, which reduces the precision of weights, have been effective. However, structured sparsity techniques, such as N:M sparsification, often face challenges due to their limited flexibility and sensitivity to important ‘outlier’ weights.
A new research paper titled “From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction” explores a more flexible approach: 8:16 semi-structured sparsity. This method allows for 8 non-zero elements within every 16, offering significantly more configuration possibilities compared to the commonly used 2:4 sparsity (which has 2 non-zero elements in 4). While 8:16 requires slightly more storage per element (0.875 bits/element vs. 0.75 bits/element for 2:4), its increased flexibility is shown to be highly beneficial for maintaining model accuracy.
The researchers, Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, and Egor Shvetsov, also tackle the issue of outlier weights. These are a small percentage of highly important weights that, if removed or altered, can significantly degrade a model’s performance. Instead of using unstructured sparsity for these outliers, which can be inefficient due to irregular memory access, the paper proposes using structured sparsity patterns (like 4:256, 8:256, or 16:256) to store them. This approach not only improves computational efficiency but also leads to better overall model performance compared to unstructured methods.
Beyond sparsity patterns, the paper introduces two key pre- and post-processing techniques to further enhance sparse model performance:
SmoothQuant-Inspired Rebalancing
This technique adapts the SmoothQuant philosophy, originally for quantization, to sparsification. It involves preprocessing weights and activations to balance their distributions before sparsification. This helps in a clearer separation of important (salient) and less important weights, making the pruning process more effective.
Also Read:
- Adaptive Parameter Allocation for Efficient LLM Compression
- OrthoRank: A New Approach to Efficient LLM Inference Through Token Selection
Variance Correction (VC)
A novel post-pruning adjustment, Variance Correction, aims to mitigate performance degradation caused by weight removal. It rescales pruned weights to preserve their original variance, thereby maintaining stable activation statistics within the model. This simple yet effective method is a new contribution to the field.
The research evaluated these methods on various large language models, including LLaMA2 (7B/13B), LLaMA3 (8B), and Mistral (7B), using standard text corpora like WikiText and C4, and zero-shot reasoning benchmarks. The results are promising:
- The 8:16 sparsity pattern significantly improves model performance. For instance, a sparse LLaMa-2-13B model using this pattern achieved the same performance as a dense LLaMa-2-7B model, demonstrating its practical value for resource-constrained environments.
- Storing salient weights using structured sparsity patterns consistently improved perplexity and accuracy across models, outperforming unstructured approaches.
- The combination of SmoothQuant-inspired rebalancing, Variance Correction, and blockwise fine-tuning (EBFT) achieved the lowest perplexity scores, highlighting the effectiveness of their integrated methodology.
- The study also revealed that different models exhibit varying robustness to pruning. Mistral, for example, showed greater inherent robustness compared to LLaMA3, and Variance Correction benefited LLaMA3 significantly but had a negative impact on Mistral, underscoring the importance of architectural dependencies.
In conclusion, this work demonstrates that structured sparsity, particularly the 8:16 pattern, combined with targeted techniques for salient weight preservation and optimized pre/post-processing, can enable efficient deployment of LLMs without compromising performance. While 8:16 sparsity is not yet natively supported on modern hardware, this research paves the way for future hardware development that could unlock even greater efficiency for large language models.


