TLDR: COMPACT is a novel, training-free pruning method for Large Language Models (LLMs) that enhances efficiency and deployment compatibility. It addresses limitations of prior pruning techniques by jointly performing rare vocabulary pruning to shrink embedding layers and common-token-weighted FFN (Feed-Forward Network) channel pruning. This dual approach is scale-adaptive, maintaining a standard transformer architecture, and delivers state-of-the-art performance across various LLM families and sizes (0.5B-70B), leading to significant reductions in parameters, GPU memory, and improved inference speed.
Large Language Models (LLMs) have become incredibly powerful, but their massive size often makes them expensive and slow to deploy, especially on devices with limited resources or for applications needing quick responses. Imagine trying to run a huge AI model on your phone or in a real-time chat application – it needs to be efficient in terms of memory, speed, and cost. This is where a technique called ‘pruning’ comes in, aiming to shrink these models without losing too much of their performance.
However, existing pruning methods have their drawbacks. Some methods, known as ‘width pruning,’ trim parts of each layer in the model, but this can sometimes break the standard structure of the AI, making it difficult to use with common software. Other methods, called ‘depth pruning,’ remove entire layers, which can lead to sudden and significant drops in accuracy. Furthermore, many pruning techniques don’t consider where the ‘fat’ in an LLM truly lies – whether it’s in the vocabulary (how many words it knows) or in the processing layers (how it thinks about information). They also often treat all words as equally important, even though some words are used far more frequently than others.
Introducing COMPACT: A Smarter Pruning Approach
Researchers Eugene Kwek and Wenpeng Yin from Penn State University have introduced a new method called COMPACT (COMMON-TOKEN–OPTIMIZED MODEL PRUNING ACROSS CHANNELS AND TOKENS) that addresses these limitations. COMPACT is designed to be both effective and practical, offering a dual approach to making LLMs more efficient.
The core idea behind COMPACT stems from two key observations:
- The way parameters are distributed within an LLM changes with its size. In smaller models, the vocabulary (the embedding and unembedding layers) holds a significant proportion of parameters. In contrast, in larger models, the Feed-Forward Networks (FFNs), which are the main processing blocks, dominate the parameter count.
- Not all words are created equal. Natural language follows a pattern where a few common words are used very frequently, while many rare words appear only occasionally. These rare words contribute much less to the overall performance of the model.
How COMPACT Works
COMPACT combines two complementary pruning modules:
1. Vocabulary Pruning: This module focuses on removing rare words from the model’s vocabulary. Since rare words are used infrequently and contribute little to performance, eliminating them directly shrinks the embedding and unembedding layers. This is particularly effective for smaller to medium-sized LLMs where the vocabulary layers are a significant part of the model’s size. This step is also very efficient, requiring no complex data analysis or additional training.
2. Common-Token–Weighted FFN Pruning: While vocabulary pruning handles the word-related parts, FFN pruning tackles the processing layers, which are crucial for larger models. Instead of treating all activations (the internal signals within the network) equally, COMPACT weights them based on the importance of common tokens. This means it prioritizes preserving the channels (pathways) in the FFNs that are most active and important for processing the words that remain after vocabulary pruning. This ensures that the model’s ability to handle common language is maintained.
Crucially, COMPACT integrates these two modules, using the knowledge of which tokens are rare to guide the FFN pruning process. This joint approach ensures that the pruning is guided by the model’s structure and the linguistic nature of the task.
Also Read:
- Enhancing Language Model Reasoning with Dynamic Confidence Assessment
- Streamlining AI’s Thought Process: A New Method for Concise Reasoning in Large Language Models
Key Advantages of COMPACT
COMPACT offers several significant benefits:
- Scale-Adaptive: It can be adjusted to suit different model sizes. For smaller LLMs, vocabulary pruning can be emphasized, while for larger ones, FFN pruning takes precedence.
- Deployment-Friendly: Unlike many width pruning methods, COMPACT maintains the standard transformer architecture. This means pruned models are compatible with popular inference engines like Huggingface and vLLM, making them practical for real-world deployment.
- Training-Free: The pruning process itself does not require extensive retraining, which can take hours or days. COMPACT can prune models in minutes on a single GPU.
- Efficiency Gains: Experiments show substantial reductions in parameters, GPU memory usage, and improved inference throughput (speed).
- Robust Performance: It achieves state-of-the-art performance across various LLM families (Qwen, LLaMA, Gemma) and scales (0.5B to 70B parameters), even at high pruning ratios. It also shows a smooth degradation in performance as more is pruned, avoiding the abrupt drops seen in some other methods.
The researchers demonstrated COMPACT’s effectiveness on a diverse set of LLMs and benchmarks, showing that it consistently outperforms baselines, especially on challenging tasks and smaller models where other methods often fail. For instance, it significantly reduces GPU memory usage and improves inference speed for both text classification and generation tasks.
In conclusion, COMPACT offers a practical and powerful solution for making large language models more efficient. By intelligently pruning rare vocabulary and common-token-weighted FFN channels, it provides a method that is adaptable to different model sizes, easy to deploy, and delivers strong performance and efficiency gains. You can read the full research paper here: COMPACT: COMMON-TOKEN–OPTIMIZED MODEL PRUNING ACROSS CHANNELS AND TOKENS.


