Research Article

Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters

by  Dimitrios Papakyriakou, Ioannis S. Barbounakis
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 41
Published: September 2025
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis
10.5120/ijca2025925727
PDF

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters. International Journal of Computer Applications. 187, 41 (September 2025), 43-57. DOI=10.5120/ijca2025925727

                        @article{ 10.5120/ijca2025925727,
                        author  = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
                        title   = { Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 41 },
                        pages   = { 43-57 },
                        doi     = { 10.5120/ijca2025925727 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Dimitrios Papakyriakou
                        %A Ioannis S. Barbounakis
                        %T Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 41
                        %P 43-57
                        %R 10.5120/ijca2025925727
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a comprehensive investigation into the strong scaling performance of distributed training for Convolutional Neural Networks (CNNs) using the MobileNetV2 architecture on a resource constrained Beowulf cluster composed of 24 Raspberry Pi 4B nodes (8 GB RAM each). The training system employs the Message Passing Interface (MPI) via MPICH with synchronous data parallelism, running two processes per node across 2 to 48 total MPI processes. A fixed CIFAR 10 dataset was used, and all experiments were standardized to 10 epochs to maintain memory stability. The study jointly evaluates execution time scaling, training/test accuracy, and convergence loss to assess both computational performance and learning quality under increasing parallelism. Training time decreased nearly ten-fold at cluster scale, reaching a maximum speedup of (≈9.99×) with (≈41.6 %) parallel efficiency at 48 processes. Efficiency remained very high at small scales (≈90.9 % at np=4) and moderate at np=8 (≈52.3 %), confirming that MPI scaling itself is effective up to this range. However, while single-node and small-scale runs (up to 4–8 MPI processes) preserved strong generalization ability, larger scales suffered from sharply reduced per-rank dataset sizes, causing gradient noise and eventual collapse of test accuracy to the random guess baseline (10 %). These results demonstrate that, although ARM based Raspberry Pi clusters can support feasible small scale distributed deep learning, strong scaling beyond an optimal process count leads to “fast but wrong” training in which wall clock performance improves but model utility on unseen data is lost. This work provides the first detailed end to end evaluation of MPI based synchronous CNN training across an ARM based edge cluster, and outlines future research including comparative scaling with SqueezeNet and exploration of ultra low power Spiking Neural Networks (SNNs) for neuromorphic edge learning.

References
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
  • Howard, A. G., et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.
  • Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799.
  • Lane, N. D., Bhattacharya, S., et al. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. In IPSN '16.
  • Dastjerdi, A. V., & Buyya, R. (2016). Fog computing: Helping the Internet of Things realize its potential. Computer, 49(8), 112–116.
  • Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
  • Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks
  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. CVPR, pp. 4510–4520
  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., … & Ng, A. Y. (2012). Large scale distributed deep networks. Advances in Neural Information Processing Systems, 25, 1–11.
  • Gropp, W., Lusk, E., & Skjellum, A. (2014). Using MPI: Portable parallel programming with the message-passing interface (3rd ed.). MIT Press.
  • Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
  • Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Convolutional Neural Networks (CNNs) Distributed Deep Learning Beowulf Cluster ARM Architecture Raspberry Pi Cluster Parallel Computing Message Passing Interface (MPI) MPICH Low-Cost Clusters Distributed Systems HPC MobileNet Edge Computing Parallel Efficiency Edge Deep Learning

Powered by PhDFocusTM