TLDR: NinA (Normalizing Flows in Action) is a new method for Vision-Language-Action (VLA) models that replaces traditional diffusion-based action decoders with Normalizing Flows. This change enables one-shot action sampling, leading to significantly faster inference times (up to 10x faster) and fewer parameters, while maintaining comparable performance to state-of-the-art diffusion models on the LIBERO benchmark. NinA offers a more efficient and practical solution for high-frequency robotic control.
Recent advancements in robotics have brought us closer to general-purpose robots, largely thanks to Vision-Language-Action (VLA) models. These sophisticated models allow robots to understand visual observations and task descriptions, then translate them into physical actions. Traditionally, a key component of these VLA systems, known as the action decoder, has relied heavily on diffusion models. While effective at handling complex action distributions, diffusion models often require multiple, iterative steps to generate an action, which can slow down the robot’s response time – a critical limitation for real-world applications demanding quick, precise movements.
Enter NinA, short for “Normalizing Flows in Action.” This innovative approach offers a compelling alternative to the slower diffusion-based decoders. NinA replaces these iterative models with Normalizing Flows (NFs), a type of generative model that can produce actions in a single, direct step. This fundamental difference dramatically reduces the time it takes for a robot to decide and execute an action, making it much more practical for high-frequency control scenarios.
Understanding Normalizing Flows
At its core, a Normalizing Flow works by transforming a simple, well-understood probability distribution (like a standard bell curve) into a more complex one, which can accurately represent the intricate patterns of robot actions. The magic lies in a sequence of invertible transformations. Imagine stretching and bending a simple shape into a highly detailed sculpture; NFs do something similar with data distributions. Because these transformations are invertible, they allow for efficient, one-shot sampling – meaning an action can be generated directly without the need for repeated refinement steps.
NinA’s Integration and Performance
The researchers integrated NinA into an existing VLA architecture called FLOWER and tested it on the LIBERO benchmark, a standard set of tasks for evaluating robot learning. The results were highly encouraging. NinA demonstrated performance comparable to its diffusion-based counterparts, meaning it could achieve similar success rates in completing tasks. However, its real advantage shone through in efficiency. NinA achieved substantially faster inference times – up to 10 times quicker in some configurations – and required significantly fewer computational parameters. For instance, a NinA Transformer model, while being 8.7 times smaller than a large diffusion model, was 7 times faster on an RTX 3060 GPU with only a marginal drop in performance.
The study explored two main architectural variants for NinA: an MLP-based (Multi-Layer Perceptron) model and a Transformer-based model. The MLP variant proved to be extremely compact and fast, while the Transformer variant offered a balance of strong performance and scalability. The team also investigated various design choices, such as the depth of the flow layers, the internal complexity of the networks, and the impact of adding a small amount of “noise” during training, finding that moderate noise injection acted as a beneficial regularizer.
Also Read:
- Gaussian World Models: Advancing Robotic Manipulation with 3D Scene Prediction
- HumanoidVerse: A Robot That Understands and Rearranges Multiple Objects with Vision and Language
The Future of Efficient Robotics
The introduction of NinA marks a significant step towards more efficient and responsive robotic systems. By leveraging the power of Normalizing Flows, robots can now execute actions with greater speed without compromising their ability to perform complex tasks. This efficiency is crucial for real-world deployment, where latency and computational resources are often constrained. Beyond just speed, Normalizing Flows also offer benefits like exact likelihood estimation, which could be valuable for future advancements in reinforcement learning, understanding uncertainty in robot actions, and making robot decisions more interpretable. The full research paper can be accessed here.
The researchers envision future work scaling NinA to even broader datasets and different robot platforms, further solidifying its role as a promising foundation for the next generation of general-purpose robotic control.


