TLDR: This paper introduces a novel mixed precision training framework for Neural Ordinary Differential Equations (Neural ODEs), addressing challenges of computational cost and memory growth. The framework uses low-precision for network evaluations and intermediate states, while maintaining stability with high-precision accumulation and a dynamic adjoint scaling scheme. It demonstrates significant memory reduction (up to 50%) and speedups (up to 2x) across various learning tasks, including image classification and generative models, without sacrificing accuracy. An open-source PyTorch package, rampde, is also released.
Deep learning models are continuously growing in size and complexity, leading to ever-increasing computational demands. To tackle these challenges, a common strategy known as mixed precision training (MPT) has emerged. MPT involves performing some computations in lower precision (e.g., 16-bit floating point) while retaining higher precision (e.g., 32-bit floating point) for critical operations, thereby reducing computational costs and memory usage.
However, applying mixed precision training to continuous-time architectures like Neural Ordinary Differential Equations (Neural ODEs) has proven unreliable. Neural ODEs define neural networks as the solution to an ordinary differential equation, meaning their forward pass involves numerically solving an initial value problem. Naively using low precision throughout can lead to an accumulation of roundoff errors and instabilities, especially as the number of time steps or layers increases.
A new research paper, titled “MIXED PRECISION TRAINING OF NEURAL ODES,” by Elena Celledoni, Brynjulf Owren, Lars Ruthotto, and Nicole Tianjiao Yang, introduces a robust mixed precision training framework specifically designed for Neural ODEs. This framework addresses the unique challenges posed by these continuous-time models, making MPT a viable and effective strategy for their training.
The core of their approach lies in a carefully designed mixed precision scheme. It utilizes low-precision computations for evaluating the neural network’s velocity function and for storing intermediate states. This is where the bulk of the computational savings come from. To ensure stability and accuracy, the accumulation of the solution and gradients, as well as the storage of network weights, are performed in higher precision. This hybrid approach balances efficiency with numerical robustness.
A key innovation presented in the paper is a custom backpropagation scheme that incorporates a dynamic adjoint scaling mechanism. This adaptive scaling heuristic maximizes the usable range of the low-precision system during backpropagation, effectively preventing underflow errors that can plague float16 precision without requiring extensive hyperparameter tuning. The researchers also provide a theoretical analysis demonstrating that roundoff errors remain within acceptable bounds and do not grow uncontrollably with the number of time steps, a crucial aspect for Neural ODEs.
To facilitate adoption and experimentation, the authors have released an extendable, open-source PyTorch package called rampde. This package is designed to be a drop-in replacement for existing Neural ODE implementations, with a syntax similar to leading packages like torchdiffeq, making it easy for developers to integrate into their current projects.
The effectiveness of this new framework was demonstrated across a range of learning tasks. In experiments with Continuous Normalizing Flows (CNFs), the mixed precision approach achieved comparable sample quality and validation losses to single-precision training, with significant memory reductions. For Optimal Transport Flows (OT-Flows) on higher-dimensional datasets, the framework delivered substantial memory savings (up to 10 times) and modest speedups.
Perhaps the most compelling results came from the STL-10 image classification task, a large-scale problem. Here, the mixed precision scheme achieved approximately 50% memory reduction and up to a 2x speedup in training time, all while maintaining accuracy comparable to single-precision training. This highlights the framework’s potential to significantly improve the scalability and efficiency of Neural ODEs for complex applications.
Also Read:
- MeDyate: Enabling Dynamic AI Training on Memory-Constrained Devices
- Unlocking Low-Precision Training: A New Theory for Adaptive Optimizer Convergence
In summary, this research provides a practical and theoretically sound solution for training Neural ODEs with mixed precision. By carefully managing precision levels and introducing dynamic scaling, the authors have overcome previous limitations, enabling faster and more memory-efficient training without compromising model performance. This advancement is particularly beneficial for large-scale problems where computational resources are a limiting factor, paving the way for broader adoption of Neural ODEs in deep learning.


