Optimizing Deep Learning Convolutions for Energy-Efficient CPUs in Embedded Systems

TLDR: This research benchmarks state-of-the-art deep learning convolution algorithms on various energy-constrained CPUs from ARM, Intel, AMD, Apple, and Nvidia. It evaluates performance based on latency, instantaneous power, and total energy consumption, using novel high-resolution power measurement techniques. Key findings include the inaccuracy of MSR-based power measurements, the superior energy efficiency of Winograd and GEMM algorithms, and the identification of the Nvidia AGX Orin as offering the best trade-off between inference speed and power consumption for full ResNet50v1.5 inference. The study provides practical guidance for energy-aware embedded AI deployments.

Deep learning models, especially convolutional neural networks (CNNs), are everywhere in modern embedded vision systems. These networks are crucial for tasks like image classification and object detection. However, the core operation within CNNs, called convolution, is computationally intensive and demands significant energy. While much research has focused on optimizing these operations for powerful GPUs and NPUs, the performance on CPUs, particularly those found in energy-constrained embedded devices, has received less attention.

A recent study dives into this gap, systematically benchmarking state-of-the-art convolution algorithms on various embedded CPUs from major vendors like ARM, Intel, AMD, and Nvidia. The goal was to provide practical guidance for deploying deep learning models efficiently in energy-sensitive environments, where factors like battery life and thermal management are critical.

Understanding the Algorithms and Hardware

The researchers evaluated four main convolution implementations: the ‘direct’ method, two ‘GEMM-based’ approaches (explicit ‘im2row’ and implicit ‘gemm’ lowering), and the ‘Winograd’ algorithm. GEMM-based methods convert convolutions into matrix multiplications, which are highly optimized on modern CPUs. The Winograd algorithm, on the other hand, reduces the number of floating-point multiplications, potentially saving computational effort.

The study utilized Intel’s OneDNN framework for implementing these algorithms and tested them on a range of modern CPUs. These included Nvidia’s AGX Xavier and AGX Orin, AMD’s Ryzen 7 7840U and Ryzen AI 9 HX 370, and Intel’s Core Ultra 9 185H. Some of these architectures feature heterogeneous CPUs, meaning they combine powerful ‘p-cores’ with more energy-efficient ‘e-cores’ or ‘LPe-cores’. The performance was measured not just by speed (latency) but also by instantaneous power consumption and total energy usage, which are vital for embedded systems.

Key Findings on Power Measurement and Core Utilization

One significant discovery concerned how power consumption is measured. The study found that Model Specific Registers (MSRs), a common method for estimating CPU power, significantly underestimated the total power drawn from the socket. In idle states, MSRs were more than 50% lower than actual socket measurements, and during computations, they were still 10% to 30% lower. This highlights the importance of accurate, socket-level measurements for a true understanding of energy consumption in embedded devices.

When looking at core utilization, the research revealed that increasing the number of physical cores generally reduced the total energy consumption for a convolution. This is because adding more cores decreases latency more significantly than it increases instantaneous power. Surprisingly, ‘e-cores’ and ‘LPe-cores’ on heterogeneous architectures consumed more energy than ‘p-cores’ for the same task. This is attributed to their much slower processing speed, which outweighs their energy efficiency per cycle.

Among the algorithms, both Winograd and GEMM-based approaches proved to be the most energy-efficient, primarily because they are the fastest across all tested architectures. The best configurations for energy efficiency involved using all ‘p-cores’ on the Nvidia AGX Orin or the AMD AI370, combined with either the Winograd or GEMM algorithms.

Performance in Full Inference Scenarios

Moving beyond individual convolution operations, the study also evaluated performance during a full inference run of the ResNet50v1.5 network. Here, a trade-off between inference latency and instantaneous power consumption became evident. While Winograd showed advantages in isolated convolution computations, the ‘gemm’ implementation often performed better in full inference due to its efficiency in managing data movements.

The Nvidia Jetson AGX Orin emerged as the architecture offering the best balance between inference speed and instantaneous power consumption. For scenarios where a higher instantaneous power budget is acceptable, the AMD Mercury EM780’s CPU could achieve even faster inference times. However, the Intel AtomMan X7 Ti showed suboptimal performance across all its CPU core types.

Also Read:

Implications for Embedded AI Deployment

This research provides crucial insights for developers and engineers working on energy-constrained AI applications. By offering a detailed, cross-vendor benchmark using accurate socket-level energy measurements, the study helps guide the selection of appropriate CPUs and convolution algorithms for embedded systems. It underscores that a holistic evaluation considering latency, power, and energy jointly is essential for realistic deployments. For more in-depth technical details, you can refer to the full paper available at arXiv:2509.26217.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing Deep Learning Convolutions for Energy-Efficient CPUs in Embedded Systems

Understanding the Algorithms and Hardware

Key Findings on Power Measurement and Core Utilization

Performance in Full Inference Scenarios

Implications for Embedded AI Deployment

Gen AI News and Updates

Fireworks AI Secures $250 Million Series C Funding, Valued at $4 Billion, to Lead AI Inference Market

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates