Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters

Dimitrios Papakyriakou; Ioannis S. Barbounakis

Research Article

Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters

by Dimitrios Papakyriakou, Ioannis S. Barbounakis

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 41

Published: September 2025

Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis

10.5120/ijca2025925727

PDF

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters. International Journal of Computer Applications. 187, 41 (September 2025), 43-57. DOI=10.5120/ijca2025925727

                        @article{ 10.5120/ijca2025925727,
                        author  = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
                        title   = { Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 41 },
                        pages   = { 43-57 },
                        doi     = { 10.5120/ijca2025925727 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Dimitrios Papakyriakou
                        %A Ioannis S. Barbounakis
                        %T Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 41
                        %P 43-57
                        %R 10.5120/ijca2025925727
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper presents a comprehensive investigation into the strong scaling performance of distributed training for Convolutional Neural Networks (CNNs) using the MobileNetV2 architecture on a resource constrained Beowulf cluster composed of 24 Raspberry Pi 4B nodes (8 GB RAM each). The training system employs the Message Passing Interface (MPI) via MPICH with synchronous data parallelism, running two processes per node across 2 to 48 total MPI processes. A fixed CIFAR 10 dataset was used, and all experiments were standardized to 10 epochs to maintain memory stability. The study jointly evaluates execution time scaling, training/test accuracy, and convergence loss to assess both computational performance and learning quality under increasing parallelism. Training time decreased nearly ten-fold at cluster scale, reaching a maximum speedup of (≈9.99×) with (≈41.6 %) parallel efficiency at 48 processes. Efficiency remained very high at small scales (≈90.9 % at np=4) and moderate at np=8 (≈52.3 %), confirming that MPI scaling itself is effective up to this range. However, while single-node and small-scale runs (up to 4–8 MPI processes) preserved strong generalization ability, larger scales suffered from sharply reduced per-rank dataset sizes, causing gradient noise and eventual collapse of test accuracy to the random guess baseline (10 %). These results demonstrate that, although ARM based Raspberry Pi clusters can support feasible small scale distributed deep learning, strong scaling beyond an optimal process count leads to “fast but wrong” training in which wall clock performance improves but model utility on unseen data is lost. This work provides the first detailed end to end evaluation of MPI based synchronous CNN training across an ARM based edge cluster, and outlines future research including comparative scaling with SqueezeNet and exploration of ultra low power Spiking Neural Networks (SNNs) for neuromorphic edge learning.

References

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Howard, A. G., et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799.
Lane, N. D., Bhattacharya, S., et al. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. In IPSN '16.
Dastjerdi, A. V., & Buyya, R. (2016). Fog computing: Helping the Internet of Things realize its potential. Computer, 49(8), 112–116.
Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. CVPR, pp. 4510–4520
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., … & Ng, A. Y. (2012). Large scale distributed deep networks. Advances in Neural Information Processing Systems, 25, 1–11.
Gropp, W., Lusk, E., & Skjellum, A. (2014). Using MPI: Portable parallel programming with the message-passing interface (3rd ed.). MIT Press.
Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Convolutional Neural Networks (CNNs) Distributed Deep Learning Beowulf Cluster ARM Architecture Raspberry Pi Cluster Parallel Computing Message Passing Interface (MPI) MPICH Low-Cost Clusters Distributed Systems HPC MobileNet Edge Computing Parallel Efficiency Edge Deep Learning