TLDR: LieQ is a new post-training quantization framework that significantly compresses small language models (under 7 billion parameters) to very low bit-widths (2-3 bits) while maintaining high accuracy. It achieves this by using three layer-wise diagnostic metrics—Perplexity Drop, Representational Compactness, and Top-k Energy Gain—to identify and protect critical layers with higher precision, allowing less critical layers to be more aggressively compressed. This approach enables efficient deployment of these models on resource-constrained edge devices, outperforming existing methods in accuracy and hardware friendliness.
Large language models, or LLMs, have transformed many areas of natural language processing. However, their massive size, often involving billions of parameters, makes them incredibly demanding in terms of memory and computational power. While large models might fit on powerful workstation GPUs, even moderately sized LLMs (those with 7 billion parameters or less) still exceed the memory capabilities of common edge devices like smartphones or single-board computers, which typically have 4-12 GB of memory. This creates a significant barrier to deploying advanced AI directly on devices, especially for applications like robotics that require low-power models.
To overcome this “memory wall,” aggressive compression techniques are essential. One promising method is Post-Training Quantization (PTQ), which reduces the precision of model weights and activations to lower bit representations (e.g., 1-8 bits) without requiring extensive retraining. While PTQ is effective, it often leads to a severe drop in accuracy, particularly when compressing models to ultra-low bit-widths like 2 or 3 bits. This problem is even more pronounced in smaller models, as they have less inherent redundancy to absorb the noise introduced by quantization.
Existing PTQ methods often face limitations. Some rely on heuristics for bit allocation, while others maintain uniform bit budgets across all layers, which can be inefficient. Finer-grained methods might achieve better accuracy but often introduce irregular data formats that hinder hardware efficiency. This brings up key challenges: how to achieve structured PTQ that preserves accuracy and maintains a regular weight layout, how to quantitatively evaluate each layer to guide compression, and how to ensure hardware efficiency under extreme low-bit PTQ.
Researchers have introduced a new framework called LieQ (Layer-wise Information Effectiveness Quantization) to address these challenges. LieQ is a metric-driven PTQ framework designed to maintain accuracy in sub-7B models even under extreme low-bit compression. It introduces three complementary layer-wise diagnostics to understand how important each layer is:
Perplexity Drop
This metric directly measures how much the model’s predictive performance drops when a specific Transformer layer is effectively removed. It quantifies the unique information contributed by each layer.
Representational Compactness
Inspired by geometric analysis, this diagnostic assesses how well information is organized within a layer’s representations after training. It compares the spectral properties of trained projections against randomly initialized ones, indicating how concentrated and sensitive the information in a layer has become due to training.
Also Read:
- VLMQ: A New Approach to Efficiently Compress Large Vision-Language Models
- MoKA: Enhancing LLM Adaptation with Gated Kronecker Mixtures
Top-k Energy Concentration
While compactness looks at overall distribution, this metric focuses on how much “energy” (or variance) is captured by the most dominant components within a layer. A higher concentration indicates more structured, task-relevant information.
These three diagnostics are combined into a unified “layer-wise information effectiveness score.” LieQ then uses this score to dynamically allocate bit-widths. It identifies the most sensitive layers and assigns them higher precision (e.g., 4-bit), while the remaining, less sensitive layers are quantized to a lower precision (e.g., 2-bit). This approach ensures that critical information is protected while maximizing compression.
A significant advantage of LieQ is its ability to achieve near-lossless accuracy at extreme compression levels. For instance, on the Qwen3-4B model, LieQ recovered 95.9% of the original FP16 baseline performance at 2.05-bit quantization, outperforming other methods like GPTQ by 19.7% and AWQ by 18.1% on average across various reasoning tasks. When applied to LLaMA3.2-3B, LieQ maintained 98.2% of baseline accuracy at 2.07-bit precision, enabling a 4x memory reduction.
Furthermore, LieQ is designed to be hardware-friendly. By maintaining a uniform bit-width within each layer, it allows weight tensors to be packed contiguously, which enables efficient processing on GPUs using standard kernels. This avoids the irregular memory layouts and kernel fragmentation that can occur with more fine-grained mixed-precision approaches, preserving GPU tensor-core throughput.
In essence, LieQ provides a principled way to compress small language models, transforming memory constraints from fundamental barriers into manageable engineering challenges. This advancement paves the way for wider deployment of powerful AI on resource-constrained edge devices. You can read the full research paper here.


