TLDR: This research explores the transferability of Sparse Autoencoders (SAEs) for interpreting compressed large language models (LLMs). It finds that SAEs trained on original, uncompressed models can effectively interpret pruned models, and that simply pruning existing SAEs achieves performance comparable to training new SAEs on the compressed models, significantly reducing computational costs for model interpretability.
Large Language Models (LLMs) have become incredibly powerful, but their sheer size often makes them challenging to use efficiently, especially during inference. To tackle this, researchers have developed various compression techniques, such as pruning and quantization, which reduce the model’s footprint without significantly sacrificing performance. However, a crucial question remains: how do these compression methods affect our ability to understand what’s happening inside these models?
This is where model interpretability comes in. Among the many approaches, Sparse Autoencoders (SAEs) have emerged as a particularly effective tool. SAEs work by breaking down a model’s internal activation space into a set of distinct, interpretable features. Think of it like finding the fundamental building blocks of the model’s thoughts. The challenge, however, is that training these SAEs can be very computationally expensive.
A recent research paper, “On the transferability of Sparse Autoencoders for interpreting compressed models,” by Suchit Gupte, Vishnu Kabir Chhabra, and Mohammad Mahdi Khalili from The Ohio State University, delves into this very issue. Their work explores whether SAEs trained on an original, uncompressed LLM can still be useful for interpreting its compressed counterpart. Even more interestingly, they investigate if simply pruning an existing SAE can achieve similar results to training a brand-new SAE specifically for the compressed model.
Key Findings and Implications
The researchers found compelling evidence that SAEs trained on the original model can indeed interpret the compressed model, with only a minor dip in performance compared to an SAE trained directly on the compressed version. This suggests a significant potential for transferability.
Perhaps the most impactful finding is that by simply pruning the original SAE itself, the performance achieved is comparable to that of an SAE trained from scratch on the pruned model. This is a game-changer because it means we might not need to incur the extensive training costs associated with developing new SAEs for every compressed model variant. This could lead to substantial savings in computational resources and time.
How Sparse Autoencoders Work
At its core, an SAE is a neural network designed to learn a sparse representation of input data. It takes an activation vector from an LLM, encodes it into a much higher-dimensional, sparse latent vector (meaning most of its values are zero), and then decodes it back to reconstruct the original input. The ‘sparsity’ is key here, as it encourages the SAE to identify distinct, meaningful features. Different activation functions like JumpReLU are used to enforce this sparsity.
Understanding Pruning with WANDA
The study specifically focused on a pruning technique called WANDA (Pruning by Weights and Activations). Pruning aims to reduce the size and computational load of neural networks by removing less important weights. WANDA is a fast and effective method that prunes pre-trained models without needing retraining. It identifies important weights by considering both their magnitude and how frequently they are used, leading to more informed pruning decisions.
Experimental Validation
To test their hypotheses, the researchers conducted experiments on two transformer models: GPT-2 Small and Gemma-2-2B. They applied WANDA pruning with 50% sparsity to key parts of these models. They then compared three types of SAEs: a pre-trained SAE (on the original model), an SAE re-trained from scratch on the pruned model, and a pruned version of the pre-trained SAE.
Evaluation was performed using SAEBench, a comprehensive suite that assesses various aspects of SAE performance, including concept detection, interpretability, reconstruction fidelity, and feature disentanglement. For GPT-2, they observed that pruned SAEs maintained reconstruction performance comparable to fully trained SAEs, even at high sparsity levels. For Gemma-2-2B, the results were even more striking: the pruned SAEs (specifically at 25% sparsity) consistently matched or even outperformed the SAEs trained from scratch on the pruned model across metrics like feature absorption, reconstruction quality, spurious correlation removal (SCR), targeted probe perturbation (TPP), and semantic disentanglement (RAVEL).
Also Read:
- Clustering Activation Patterns for Efficient LLM Inference
- Unpacking Neural Network Behavior: Capacity, Sparsity, and Resilience in Smaller Models
Conclusion
This research provides strong evidence for the transferability and efficiency of Sparse Autoencoders in the context of compressed language models. The ability to reuse or simply prune existing SAEs to interpret compressed models offers a practical and computationally efficient alternative to expensive retraining. This work significantly contributes to making model interpretability more scalable and accessible, especially as LLMs continue to grow in size and complexity. For more details, you can read the full paper here: On the transferability of Sparse Autoencoders for interpreting compressed models.


