TLDR: This research systematically evaluates Vision-Language-Action (VLA) models across edge and datacenter GPUs, analyzing accuracy, memory, latency, and throughput under varying power budgets. It finds that architectural choices significantly impact performance, power-constrained edge devices can surprisingly match or exceed older datacenter GPUs, and high throughput can be achieved with minimal accuracy loss. The study provides critical insights for optimizing VLA deployment in diverse robotic systems.
Vision-Language-Action (VLA) models are rapidly becoming the go-to solution for controlling robots, allowing them to understand instructions, perceive their surroundings, and perform actions directly from visual and linguistic inputs. These models promise a future where a single, versatile AI can manage various robotic tasks, moving away from the need for specialized models for every job. However, a significant challenge remains: how do these powerful models perform across different types of hardware, from small, power-efficient edge devices to high-performance cloud data centers, and what are their power requirements?
A recent research paper, titled “Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs,” delves into this critical question. The study, conducted by researchers Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, and Yanzhi Wang, provides a comprehensive evaluation of VLA models, shedding light on their performance characteristics and resource demands across a spectrum of computing environments. You can read the full paper here: Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs.
The Challenge of Scaling Robotic AI
Despite rapid advancements in VLA algorithms, there’s been a limited understanding of how their performance scales with different model designs, hardware classes, and power budgets. Most previous studies focused on improving accuracy or adapting existing vision-language architectures, often evaluating them on a single hardware platform. This gap in knowledge makes it difficult for engineers to make informed decisions when deploying VLAs in real-world robotic systems, where factors like latency, throughput, memory usage, and energy consumption are crucial.
For instance, embedded platforms like service robots or autonomous drones require a careful balance of model size, speed, and accuracy due to strict computational and power constraints. In contrast, cloud-hosted environments prioritize maximizing throughput while managing memory usage to control operational costs. Without a clear picture of how VLAs perform across this “edge-to-cloud” spectrum, there’s a risk of inefficient hardware provisioning and suboptimal performance trade-offs.
Evaluating VLA Models and Hardware
The researchers evaluated five representative VLA models, including three established baselines (OpenVLA, SpatialVLA, and OpenVLA-OFT) and two newly proposed architectures: VOTE and QwenVLA. VOTE aims to reduce inference latency by generating fewer action tokens without sacrificing accuracy, while QwenVLA explores the impact of a smaller language backbone (Qwen 2.5-1.5B) while maintaining competitive performance.
To understand performance across different computing environments, the study used two main hardware categories:
- **Edge Computing Platform:** The NVIDIA Jetson AGX Orin, a system-on-chip (SoC) designed for power-efficient AI workloads. It supports multiple power modes (15W, 30W, 50W, and MAX), allowing for an exploration of performance-energy trade-offs.
- **Datacenter GPU Platforms:** Four discrete NVIDIA GPUs representing different generations and performance tiers: H100 (Hopper), A100 (Ampere), A6000 (Ampere), and V100 (Volta). These offer significantly higher compute throughput and dedicated high-bandwidth memory compared to edge devices.
The models were benchmarked using the LIBERO benchmark, which includes a diverse set of robotic manipulation tasks. Key metrics measured were accuracy (success rate), peak memory usage, latency (time to generate an action chunk), and throughput (actions generated per second).
Key Findings and Insights
The study revealed several important scaling trends:
First, **VLA Model Accuracy:** The newly developed VOTE variants, particularly VOTE-1T, achieved the highest average success rates on the LIBERO benchmark (96.9%), outperforming existing baselines. QwenVLA, despite its smaller size, also showed competitive accuracy, surpassing the larger OpenVLA baseline in average success rate (78.8% vs. 76.5%). This suggests that efficient designs can achieve strong performance.
Second, **Peak Memory Usage:** Memory consumption was primarily influenced by the size of the model’s language backbone and the choice of vision encoder. QwenVLA, with its smaller 1.5B backbone, had the lowest memory footprint (7.39 GB). Action head variations had a negligible impact on memory.
Third, **Latency and Throughput:** As expected, datacenter GPUs like the H100 delivered significantly lower latencies and higher throughputs compared to the edge-based Orin. However, the study uncovered a crucial insight: modern high-end edge devices can sometimes outperform older datacenter hardware. For example, the Orin in MAX power mode, running VOTE-MLP4, achieved a throughput of 55.57 Hz, surpassing the V100 datacenter GPU’s 32.28 Hz. This challenges the assumption that any datacenter GPU will automatically exceed edge performance.
Furthermore, architectural choices, such as action tokenization and model backbone size, strongly influenced throughput and memory footprint. Power-constrained edge devices showed non-linear performance degradation, meaning performance drops sharply as power budgets decrease, especially for more computationally intensive models. However, the study also found that high-throughput variants could be achieved without significant loss in accuracy, particularly with optimized architectures like VOTE.
Also Read:
Actionable Guidance for Robotic AI Deployment
These findings offer valuable guidance for engineers and developers working with robotic AI. When selecting and optimizing VLA models, it’s crucial to consider the interplay between architectural choices, hardware capabilities, and power constraints. Architectures optimized for chunked decoding, like VOTE-2T and VOTE-MLP4, consistently delivered the highest throughput across different hardware, often with only minor trade-offs in accuracy. Models with smaller backbones, such as QwenVLA, provide a good balance of low memory usage and competitive accuracy.
The research highlights that the “edge-to-cloud” performance landscape is more nuanced than previously thought. Modern, high-end edge devices are increasingly capable, potentially offering performance that rivals or even exceeds older datacenter GPUs for specific VLA workloads. This suggests that careful optimization and hardware selection can lead to highly efficient and responsive robotic systems, even in resource-constrained environments.


