International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 41 |
Published: September 2025 |
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis |
![]() |
Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters. International Journal of Computer Applications. 187, 41 (September 2025), 43-57. DOI=10.5120/ijca2025925727
@article{ 10.5120/ijca2025925727, author = { Dimitrios Papakyriakou,Ioannis S. Barbounakis }, title = { Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 41 }, pages = { 43-57 }, doi = { 10.5120/ijca2025925727 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Dimitrios Papakyriakou %A Ioannis S. Barbounakis %T Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters%T %J International Journal of Computer Applications %V 187 %N 41 %P 43-57 %R 10.5120/ijca2025925727 %I Foundation of Computer Science (FCS), NY, USA
This paper presents a comprehensive investigation into the strong scaling performance of distributed training for Convolutional Neural Networks (CNNs) using the MobileNetV2 architecture on a resource constrained Beowulf cluster composed of 24 Raspberry Pi 4B nodes (8 GB RAM each). The training system employs the Message Passing Interface (MPI) via MPICH with synchronous data parallelism, running two processes per node across 2 to 48 total MPI processes. A fixed CIFAR 10 dataset was used, and all experiments were standardized to 10 epochs to maintain memory stability. The study jointly evaluates execution time scaling, training/test accuracy, and convergence loss to assess both computational performance and learning quality under increasing parallelism. Training time decreased nearly ten-fold at cluster scale, reaching a maximum speedup of (≈9.99×) with (≈41.6 %) parallel efficiency at 48 processes. Efficiency remained very high at small scales (≈90.9 % at np=4) and moderate at np=8 (≈52.3 %), confirming that MPI scaling itself is effective up to this range. However, while single-node and small-scale runs (up to 4–8 MPI processes) preserved strong generalization ability, larger scales suffered from sharply reduced per-rank dataset sizes, causing gradient noise and eventual collapse of test accuracy to the random guess baseline (10 %). These results demonstrate that, although ARM based Raspberry Pi clusters can support feasible small scale distributed deep learning, strong scaling beyond an optimal process count leads to “fast but wrong” training in which wall clock performance improves but model utility on unseen data is lost. This work provides the first detailed end to end evaluation of MPI based synchronous CNN training across an ARM based edge cluster, and outlines future research including comparative scaling with SqueezeNet and exploration of ultra low power Spiking Neural Networks (SNNs) for neuromorphic edge learning.