TLDR: GradLite is a novel optimizer that enables memory-efficient full fine-tuning of large language models (LLMs) by tolerating approximate gradients. It uses low-rank Jacobian approximation to reduce memory for backpropagation and error-feedback correction to maintain convergence. This approach reduces memory consumption by up to 50% and achieves competitive or superior performance on benchmarks compared to existing methods, without requiring architectural changes.
Training large language models (LLMs) to their full potential often hits a major roadblock: memory. Traditional optimization methods like SGD or Adam demand precise gradients, which means storing vast amounts of intermediate data during the training process. For models with billions of parameters, this quickly overwhelms even powerful GPUs, forcing developers to resort to complex distributed setups or memory-computation trade-offs.
Existing solutions typically tackle this challenge by altering the model’s architecture, such as using reversible networks, or by implementing system-level tricks like activation checkpointing, ZeRO, and FSDP. While effective, these methods often introduce their own set of problems, including increased computational overhead, communication bottlenecks, or a sacrifice in the model’s expressive power, as seen with parameter-efficient fine-tuning (PEFT) techniques like LoRA.
A new research paper introduces a fresh perspective: what if the optimizer itself could be made more flexible? Researchers from Sun Yat-sen University, Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, and Keze Wang, propose GradLite, a “backward-friendly” optimizer that rethinks the fundamental assumption of needing exact gradients. This innovation allows for efficient LLM training even when intermediate data is aggressively discarded or approximated, significantly easing memory constraints.
GradLite achieves its remarkable efficiency through two core techniques. First is low-rank Jacobian approximation. Imagine the complex error signals that need to be backpropagated through the network. Instead of processing them in their full, memory-intensive form, GradLite approximates these signals by projecting them onto a much smaller, low-dimensional space. This drastically reduces the memory needed for backpropagation.
The second technique is error-feedback correction. Approximating gradients could introduce inaccuracies, potentially hindering convergence. GradLite addresses this by maintaining an accumulator that tracks and compensates for these approximation errors across training iterations. This feedback loop ensures that any information lost in one step is eventually incorporated into a future update, guaranteeing unbiased gradient estimates and stable convergence, similar to standard optimizers like Adam.
The theoretical analysis supporting GradLite demonstrates that it maintains unbiased gradient estimates with bounded variance, leading to convergence rates comparable to Adam. This means developers can achieve significant memory savings without compromising the model’s ability to learn effectively.
Empirical evaluations using the Qwen1.5-MoE-A2.7B model on a single NVIDIA H800 GPU showcased GradLite’s impressive capabilities. When fine-tuning on the databricks-dolly-15k instruction-following dataset, GradLite reduced peak VRAM usage by up to 50% compared to traditional full fine-tuning with activation checkpointing. It also outperformed other optimizer-centric baselines like LoMo and GaLore in memory efficiency.
Crucially, these efficiency gains did not come at the expense of performance. GradLite achieved on-par or even superior results on various downstream benchmarks, including MMLU (Massive Multitask Language Understanding), GSM8K (math word problems), multilingual evaluation, and MT-Bench (dialogue). For instance, GradLite scored 66.8% on MMLU and 75.3% on GSM8K, surpassing the SFT (Supervised Fine-Tuning) baseline with checkpointing.
An ablation study further highlighted the importance of GradLite’s components. Disabling the error-feedback mechanism led to a significant performance drop, underscoring its critical role in correcting approximation bias. Similarly, using a static random projection instead of an adaptive low-rank basis also degraded performance, proving that the projection subspace must adapt to the gradient manifold for optimal results.
Also Read:
- New Quantization Method Makes Large Language Models More Efficient
- Boosting LLM Efficiency: How Token Permutation Makes Attention Sparser
In conclusion, GradLite offers a compelling new direction for training large language models under memory constraints. By intelligently approximating gradients and correcting for errors, it enables full-parameter fine-tuning with substantial memory savings and competitive performance, without requiring complex architectural changes or multi-GPU setups. This work opens up new possibilities for making advanced LLM training more accessible and efficient. You can read the full research paper here.


