COMPACT: A Dual Pruning Strategy for Efficient and Deployable Large Language Models

TLDR: COMPACT is a novel, training-free pruning method for Large Language Models (LLMs) that enhances efficiency and deployment compatibility. It addresses limitations of prior pruning techniques by jointly performing rare vocabulary pruning to shrink embedding layers and common-token-weighted FFN (Feed-Forward Network) channel pruning. This dual approach is scale-adaptive, maintaining a standard transformer architecture, and delivers state-of-the-art performance across various LLM families and sizes (0.5B-70B), leading to significant reductions in parameters, GPU memory, and improved inference speed.

Large Language Models (LLMs) have become incredibly powerful, but their massive size often makes them expensive and slow to deploy, especially on devices with limited resources or for applications needing quick responses. Imagine trying to run a huge AI model on your phone or in a real-time chat application – it needs to be efficient in terms of memory, speed, and cost. This is where a technique called ‘pruning’ comes in, aiming to shrink these models without losing too much of their performance.

However, existing pruning methods have their drawbacks. Some methods, known as ‘width pruning,’ trim parts of each layer in the model, but this can sometimes break the standard structure of the AI, making it difficult to use with common software. Other methods, called ‘depth pruning,’ remove entire layers, which can lead to sudden and significant drops in accuracy. Furthermore, many pruning techniques don’t consider where the ‘fat’ in an LLM truly lies – whether it’s in the vocabulary (how many words it knows) or in the processing layers (how it thinks about information). They also often treat all words as equally important, even though some words are used far more frequently than others.

Introducing COMPACT: A Smarter Pruning Approach

Researchers Eugene Kwek and Wenpeng Yin from Penn State University have introduced a new method called COMPACT (COMMON-TOKEN–OPTIMIZED MODEL PRUNING ACROSS CHANNELS AND TOKENS) that addresses these limitations. COMPACT is designed to be both effective and practical, offering a dual approach to making LLMs more efficient.

The core idea behind COMPACT stems from two key observations:

The way parameters are distributed within an LLM changes with its size. In smaller models, the vocabulary (the embedding and unembedding layers) holds a significant proportion of parameters. In contrast, in larger models, the Feed-Forward Networks (FFNs), which are the main processing blocks, dominate the parameter count.
Not all words are created equal. Natural language follows a pattern where a few common words are used very frequently, while many rare words appear only occasionally. These rare words contribute much less to the overall performance of the model.

How COMPACT Works

COMPACT combines two complementary pruning modules:

1. Vocabulary Pruning: This module focuses on removing rare words from the model’s vocabulary. Since rare words are used infrequently and contribute little to performance, eliminating them directly shrinks the embedding and unembedding layers. This is particularly effective for smaller to medium-sized LLMs where the vocabulary layers are a significant part of the model’s size. This step is also very efficient, requiring no complex data analysis or additional training.

2. Common-Token–Weighted FFN Pruning: While vocabulary pruning handles the word-related parts, FFN pruning tackles the processing layers, which are crucial for larger models. Instead of treating all activations (the internal signals within the network) equally, COMPACT weights them based on the importance of common tokens. This means it prioritizes preserving the channels (pathways) in the FFNs that are most active and important for processing the words that remain after vocabulary pruning. This ensures that the model’s ability to handle common language is maintained.

Crucially, COMPACT integrates these two modules, using the knowledge of which tokens are rare to guide the FFN pruning process. This joint approach ensures that the pruning is guided by the model’s structure and the linguistic nature of the task.

Also Read:

Key Advantages of COMPACT

COMPACT offers several significant benefits:

Scale-Adaptive: It can be adjusted to suit different model sizes. For smaller LLMs, vocabulary pruning can be emphasized, while for larger ones, FFN pruning takes precedence.
Deployment-Friendly: Unlike many width pruning methods, COMPACT maintains the standard transformer architecture. This means pruned models are compatible with popular inference engines like Huggingface and vLLM, making them practical for real-world deployment.
Training-Free: The pruning process itself does not require extensive retraining, which can take hours or days. COMPACT can prune models in minutes on a single GPU.
Efficiency Gains: Experiments show substantial reductions in parameters, GPU memory usage, and improved inference throughput (speed).
Robust Performance: It achieves state-of-the-art performance across various LLM families (Qwen, LLaMA, Gemma) and scales (0.5B to 70B parameters), even at high pruning ratios. It also shows a smooth degradation in performance as more is pruned, avoiding the abrupt drops seen in some other methods.

The researchers demonstrated COMPACT’s effectiveness on a diverse set of LLMs and benchmarks, showing that it consistently outperforms baselines, especially on challenging tasks and smaller models where other methods often fail. For instance, it significantly reduces GPU memory usage and improves inference speed for both text classification and generation tasks.

In conclusion, COMPACT offers a practical and powerful solution for making large language models more efficient. By intelligently pruning rare vocabulary and common-token-weighted FFN channels, it provides a method that is adaptable to different model sizes, easy to deploy, and delivers strong performance and efficiency gains. You can read the full research paper here: COMPACT: COMMON-TOKEN–OPTIMIZED MODEL PRUNING ACROSS CHANNELS AND TOKENS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

COMPACT: A Dual Pruning Strategy for Efficient and Deployable Large Language Models

Introducing COMPACT: A Smarter Pruning Approach

How COMPACT Works

Key Advantages of COMPACT

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates