Optimizing Robotic AI: A Deep Dive into VLA Model Performance on Edge and Cloud Hardware

TLDR: This research systematically evaluates Vision-Language-Action (VLA) models across edge and datacenter GPUs, analyzing accuracy, memory, latency, and throughput under varying power budgets. It finds that architectural choices significantly impact performance, power-constrained edge devices can surprisingly match or exceed older datacenter GPUs, and high throughput can be achieved with minimal accuracy loss. The study provides critical insights for optimizing VLA deployment in diverse robotic systems.

Vision-Language-Action (VLA) models are rapidly becoming the go-to solution for controlling robots, allowing them to understand instructions, perceive their surroundings, and perform actions directly from visual and linguistic inputs. These models promise a future where a single, versatile AI can manage various robotic tasks, moving away from the need for specialized models for every job. However, a significant challenge remains: how do these powerful models perform across different types of hardware, from small, power-efficient edge devices to high-performance cloud data centers, and what are their power requirements?

A recent research paper, titled “Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs,” delves into this critical question. The study, conducted by researchers Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, and Yanzhi Wang, provides a comprehensive evaluation of VLA models, shedding light on their performance characteristics and resource demands across a spectrum of computing environments. You can read the full paper here: Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs.

The Challenge of Scaling Robotic AI

Despite rapid advancements in VLA algorithms, there’s been a limited understanding of how their performance scales with different model designs, hardware classes, and power budgets. Most previous studies focused on improving accuracy or adapting existing vision-language architectures, often evaluating them on a single hardware platform. This gap in knowledge makes it difficult for engineers to make informed decisions when deploying VLAs in real-world robotic systems, where factors like latency, throughput, memory usage, and energy consumption are crucial.

For instance, embedded platforms like service robots or autonomous drones require a careful balance of model size, speed, and accuracy due to strict computational and power constraints. In contrast, cloud-hosted environments prioritize maximizing throughput while managing memory usage to control operational costs. Without a clear picture of how VLAs perform across this “edge-to-cloud” spectrum, there’s a risk of inefficient hardware provisioning and suboptimal performance trade-offs.

Evaluating VLA Models and Hardware

The researchers evaluated five representative VLA models, including three established baselines (OpenVLA, SpatialVLA, and OpenVLA-OFT) and two newly proposed architectures: VOTE and QwenVLA. VOTE aims to reduce inference latency by generating fewer action tokens without sacrificing accuracy, while QwenVLA explores the impact of a smaller language backbone (Qwen 2.5-1.5B) while maintaining competitive performance.

To understand performance across different computing environments, the study used two main hardware categories:

**Edge Computing Platform:** The NVIDIA Jetson AGX Orin, a system-on-chip (SoC) designed for power-efficient AI workloads. It supports multiple power modes (15W, 30W, 50W, and MAX), allowing for an exploration of performance-energy trade-offs.
**Datacenter GPU Platforms:** Four discrete NVIDIA GPUs representing different generations and performance tiers: H100 (Hopper), A100 (Ampere), A6000 (Ampere), and V100 (Volta). These offer significantly higher compute throughput and dedicated high-bandwidth memory compared to edge devices.

The models were benchmarked using the LIBERO benchmark, which includes a diverse set of robotic manipulation tasks. Key metrics measured were accuracy (success rate), peak memory usage, latency (time to generate an action chunk), and throughput (actions generated per second).

Key Findings and Insights

The study revealed several important scaling trends:

First, **VLA Model Accuracy:** The newly developed VOTE variants, particularly VOTE-1T, achieved the highest average success rates on the LIBERO benchmark (96.9%), outperforming existing baselines. QwenVLA, despite its smaller size, also showed competitive accuracy, surpassing the larger OpenVLA baseline in average success rate (78.8% vs. 76.5%). This suggests that efficient designs can achieve strong performance.

Second, **Peak Memory Usage:** Memory consumption was primarily influenced by the size of the model’s language backbone and the choice of vision encoder. QwenVLA, with its smaller 1.5B backbone, had the lowest memory footprint (7.39 GB). Action head variations had a negligible impact on memory.

Third, **Latency and Throughput:** As expected, datacenter GPUs like the H100 delivered significantly lower latencies and higher throughputs compared to the edge-based Orin. However, the study uncovered a crucial insight: modern high-end edge devices can sometimes outperform older datacenter hardware. For example, the Orin in MAX power mode, running VOTE-MLP4, achieved a throughput of 55.57 Hz, surpassing the V100 datacenter GPU’s 32.28 Hz. This challenges the assumption that any datacenter GPU will automatically exceed edge performance.

Furthermore, architectural choices, such as action tokenization and model backbone size, strongly influenced throughput and memory footprint. Power-constrained edge devices showed non-linear performance degradation, meaning performance drops sharply as power budgets decrease, especially for more computationally intensive models. However, the study also found that high-throughput variants could be achieved without significant loss in accuracy, particularly with optimized architectures like VOTE.

Also Read:

Actionable Guidance for Robotic AI Deployment

These findings offer valuable guidance for engineers and developers working with robotic AI. When selecting and optimizing VLA models, it’s crucial to consider the interplay between architectural choices, hardware capabilities, and power constraints. Architectures optimized for chunked decoding, like VOTE-2T and VOTE-MLP4, consistently delivered the highest throughput across different hardware, often with only minor trade-offs in accuracy. Models with smaller backbones, such as QwenVLA, provide a good balance of low memory usage and competitive accuracy.

The research highlights that the “edge-to-cloud” performance landscape is more nuanced than previously thought. Modern, high-end edge devices are increasingly capable, potentially offering performance that rivals or even exceeds older datacenter GPUs for specific VLA workloads. This suggests that careful optimization and hardware selection can lead to highly efficient and responsive robotic systems, even in resource-constrained environments.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing Robotic AI: A Deep Dive into VLA Model Performance on Edge and Cloud Hardware

The Challenge of Scaling Robotic AI

Evaluating VLA Models and Hardware

Key Findings and Insights

Actionable Guidance for Robotic AI Deployment

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Upwork Study Reveals AI Agents Thrive with Human Collaboration, Struggle Alone

Bairong Inc. and Shanghai Pudong Development Bank Forge AI-Powered Strategic Alliance for Financial Agent Deployment

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates