TLDR: This research explores Split Learning (SL) for running deep learning models on ultra-low-power IoT devices. It benchmarks a MobileNetV2 model split across two ESP32-S3 boards, comparing wireless communication protocols (ESP-NOW, BLE, UDP, TCP). The study finds that splitting the model after ‘block_16_project_BN’ yields a small intermediate tensor (5.66 kB). ESP-NOW offers the lowest overall round-trip time (3.7s) due to minimal setup, while UDP provides the fastest transmission latency (1.4ms). BLE and TCP show higher latencies due to MTU limits and overheads, respectively. The paper provides empirical data for efficient TinyML + SL deployments.
In the rapidly evolving world of Artificial Intelligence, a significant challenge lies in deploying powerful deep learning models on tiny, resource-constrained devices like those found in the Internet of Things (IoT). These devices, often operating on minimal power, have tight memory and processing limitations that make direct execution of complex AI tasks difficult.
A promising solution to this challenge is Split Learning (SL), a technique where a deep learning model is divided into parts. The initial layers of the model run on the low-power sensor device, while the remaining, more computationally intensive layers are offloaded to a companion device, such as another microcontroller, a gateway, or a nearby edge server. This approach helps to preserve data privacy and reduce bandwidth usage by only exchanging intermediate data (activations) between devices.
Despite its potential, the performance of split learning, especially concerning the impact of low-power wireless communication protocols, has remained largely unexplored on constrained microcontrollers. To address this gap, a recent experimental study built the first end-to-end TinyML + SL testbed using Espressif ESP32-S3 boards. The goal was to benchmark the over-the-air performance of split learning TinyML in real-world edge/IoT environments.
The researchers utilized a MobileNetV2 image recognition model, a lightweight convolutional neural network, which was quantized to 8-bit integers to further reduce its size and computational demands. This model was then partitioned and delivered to the ESP32-S3 nodes using over-the-air updates. A crucial aspect of the study involved testing different wireless communication methods for exchanging the intermediate activations between the split model parts. These methods included ESP-NOW, Bluetooth Low Energy (BLE), and traditional UDP/IP and TCP/IP, allowing for a direct comparison on identical hardware.
The study revealed several key insights into the performance of split learning. Measurements showed that splitting the MobileNetV2 model after the ‘block_16_project_BN’ layer was particularly effective. This split point generated a compact 5.66 kB tensor of intermediate data, which could traverse the wireless link very quickly. When using UDP, this data transfer took only 3.2 ms, leading to a steady-state round-trip latency of 5.8 seconds. This was a significant improvement, being more than 20 times faster than sending the entire raw image to a remote server for full inference.
Among the communication protocols, ESP-NOW demonstrated the most favorable overall round-trip time (RTT) performance, achieving 3.7 seconds. This is largely due to its minimal setup time (around 48 ms) and efficient peer-to-peer architecture that bypasses the complexities of an IP stack, making it ideal for low-latency applications with small data transfers. While ESP-NOW has a smaller maximum packet size (250 bytes), which can increase transmission delay for larger payloads, its overall efficiency for RTT was superior.
UDP, a connectionless protocol, achieved the lowest transmission latency for intermediate activations (as low as 1.4 ms with 2 packets for the ‘Block 15 project layer’ split). However, it lacks built-in reliability. TCP, while offering reliable data delivery, incurred higher latency due to its inherent overheads like connection setup (a three-way handshake), acknowledgments, and retransmissions. Bluetooth Low Energy (BLE), despite its energy efficiency, significantly increased latency, stretching beyond 10 seconds. This was primarily attributed to its limited data rate and a smaller Maximum Transmission Unit (MTU) of 512 bytes, leading to more packet fragmentation and overhead during transmission.
The research also highlighted that the computational workload on the first IoT device (Device 1), which handles the initial part of the model and interfaces with the camera, was significantly higher in terms of inference time (3053.75 ms) compared to the second device (Device 2), which primarily performed classification tasks (437.0 ms).
Also Read:
- Enhancing AI Training: A New Approach to Over-the-Air Federated Distillation
- Unlocking Efficient Privacy in Machine Learning: A Deep Dive into Cross-Level Optimizations
In conclusion, this experimental study provides valuable empirical evidence for implementing split learning on ultra-low-power edge/IoT nodes. It demonstrates that careful selection of the model split point and the communication protocol can drastically impact the end-to-end latency. The findings suggest that ESP-NOW is highly efficient for low-latency communication of small data transfers on ESP32 devices, while UDP offers the lowest transmission latency. The choice of protocol, however, should always consider trade-offs in reliability, data size, and energy consumption. This work lays a foundation for future dynamic and adaptive frameworks for split TinyML inference, aiming to optimize performance based on real-time network conditions and device resources. You can find more details about this research paper here: An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes.


