spot_img
HomeResearch & DevelopmentOptimizing Deep Learning Convolutions for Energy-Efficient CPUs in Embedded...

Optimizing Deep Learning Convolutions for Energy-Efficient CPUs in Embedded Systems

TLDR: This research benchmarks state-of-the-art deep learning convolution algorithms on various energy-constrained CPUs from ARM, Intel, AMD, Apple, and Nvidia. It evaluates performance based on latency, instantaneous power, and total energy consumption, using novel high-resolution power measurement techniques. Key findings include the inaccuracy of MSR-based power measurements, the superior energy efficiency of Winograd and GEMM algorithms, and the identification of the Nvidia AGX Orin as offering the best trade-off between inference speed and power consumption for full ResNet50v1.5 inference. The study provides practical guidance for energy-aware embedded AI deployments.

Deep learning models, especially convolutional neural networks (CNNs), are everywhere in modern embedded vision systems. These networks are crucial for tasks like image classification and object detection. However, the core operation within CNNs, called convolution, is computationally intensive and demands significant energy. While much research has focused on optimizing these operations for powerful GPUs and NPUs, the performance on CPUs, particularly those found in energy-constrained embedded devices, has received less attention.

A recent study dives into this gap, systematically benchmarking state-of-the-art convolution algorithms on various embedded CPUs from major vendors like ARM, Intel, AMD, and Nvidia. The goal was to provide practical guidance for deploying deep learning models efficiently in energy-sensitive environments, where factors like battery life and thermal management are critical.

Understanding the Algorithms and Hardware

The researchers evaluated four main convolution implementations: the ‘direct’ method, two ‘GEMM-based’ approaches (explicit ‘im2row’ and implicit ‘gemm’ lowering), and the ‘Winograd’ algorithm. GEMM-based methods convert convolutions into matrix multiplications, which are highly optimized on modern CPUs. The Winograd algorithm, on the other hand, reduces the number of floating-point multiplications, potentially saving computational effort.

The study utilized Intel’s OneDNN framework for implementing these algorithms and tested them on a range of modern CPUs. These included Nvidia’s AGX Xavier and AGX Orin, AMD’s Ryzen 7 7840U and Ryzen AI 9 HX 370, and Intel’s Core Ultra 9 185H. Some of these architectures feature heterogeneous CPUs, meaning they combine powerful ‘p-cores’ with more energy-efficient ‘e-cores’ or ‘LPe-cores’. The performance was measured not just by speed (latency) but also by instantaneous power consumption and total energy usage, which are vital for embedded systems.

Key Findings on Power Measurement and Core Utilization

One significant discovery concerned how power consumption is measured. The study found that Model Specific Registers (MSRs), a common method for estimating CPU power, significantly underestimated the total power drawn from the socket. In idle states, MSRs were more than 50% lower than actual socket measurements, and during computations, they were still 10% to 30% lower. This highlights the importance of accurate, socket-level measurements for a true understanding of energy consumption in embedded devices.

When looking at core utilization, the research revealed that increasing the number of physical cores generally reduced the total energy consumption for a convolution. This is because adding more cores decreases latency more significantly than it increases instantaneous power. Surprisingly, ‘e-cores’ and ‘LPe-cores’ on heterogeneous architectures consumed more energy than ‘p-cores’ for the same task. This is attributed to their much slower processing speed, which outweighs their energy efficiency per cycle.

Among the algorithms, both Winograd and GEMM-based approaches proved to be the most energy-efficient, primarily because they are the fastest across all tested architectures. The best configurations for energy efficiency involved using all ‘p-cores’ on the Nvidia AGX Orin or the AMD AI370, combined with either the Winograd or GEMM algorithms.

Performance in Full Inference Scenarios

Moving beyond individual convolution operations, the study also evaluated performance during a full inference run of the ResNet50v1.5 network. Here, a trade-off between inference latency and instantaneous power consumption became evident. While Winograd showed advantages in isolated convolution computations, the ‘gemm’ implementation often performed better in full inference due to its efficiency in managing data movements.

The Nvidia Jetson AGX Orin emerged as the architecture offering the best balance between inference speed and instantaneous power consumption. For scenarios where a higher instantaneous power budget is acceptable, the AMD Mercury EM780’s CPU could achieve even faster inference times. However, the Intel AtomMan X7 Ti showed suboptimal performance across all its CPU core types.

Also Read:

Implications for Embedded AI Deployment

This research provides crucial insights for developers and engineers working on energy-constrained AI applications. By offering a detailed, cross-vendor benchmark using accurate socket-level energy measurements, the study helps guide the selection of appropriate CPUs and convolution algorithms for embedded systems. It underscores that a holistic evaluation considering latency, power, and energy jointly is essential for realistic deployments. For more in-depth technical details, you can refer to the full paper available at arXiv:2509.26217.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -