Research Article

AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems

by  Rama Krishna Reddy Arumalla
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 98
Published: April 2026
Authors: Rama Krishna Reddy Arumalla
10.5120/ijcadab8ea8eb453
PDF

Rama Krishna Reddy Arumalla . AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems. International Journal of Computer Applications. 187, 98 (April 2026), 6-11. DOI=10.5120/ijcadab8ea8eb453

                        @article{ 10.5120/ijcadab8ea8eb453,
                        author  = { Rama Krishna Reddy Arumalla },
                        title   = { AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 98 },
                        pages   = { 6-11 },
                        doi     = { 10.5120/ijcadab8ea8eb453 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2026
                        %A Rama Krishna Reddy Arumalla
                        %T AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 98
                        %P 6-11
                        %R 10.5120/ijcadab8ea8eb453
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Distributed e-commerce systems now face unprecedented issues of uptime and performance because of the complexity of microservices systems. The intended study suggests an Intelligent Observability and Incident Response Framework that would actively detect bottlenecks and automate the recovery processes. The research paper is based on a filtered dataset of 452 working telemetry examples, including such measures as request latency, CPU utilization, memory pressure, and error rates recorded during the peak traffic scenarios. The framework takes advantage of a pile of open-source monitoring agents, time-series databases, and automated orchestration engines to shift it away to predictive observability. The findings show the Mean Time to Detect and Mean Time to Repair are reduced significantly. These results indicate that machine learning can be used in conjunction with conventional telemetry to identify silent failures not detected by conventional threshold-based alerts. The paper describes the architecture design, the implementation of the smart layer, and an overall discussion of the system performance at different load states, which can be applied to the blueprint of a resilient digital commerce infrastructure.

References
  • B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper: A Large-Scale Distributed Systems Tracing Infrastructure,” Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010. https://research.google.com/pubs/archive/36356.pdf
  • J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.https://doi.org/10.1145/2408776.2408794
  • W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting Large-Scale System Problems by Mining Console Logs,” Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2009.https://doi.org/10.1145/1629575.1629587
  • J. Thalheim, A. Rodrigues, I. E. Akkus, P. Bhatotia, R. Chen, B. Viswanath, L. Jiao, and C. Fetzer, “Sieve: Actionable Insights from Monitored Metrics in Microservices,” IEEE/ACM International Conference on Distributed Systems Platforms, 2017.https://arxiv.org/abs/1709.06686
  • F. Lin, K. Muzumdar, N. Laptev, M. Curelea, S. Lee, and S. Sankar, “Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment,” IEEE International Conference on Big Data, 2019.https://arxiv.org/abs/1911.01225
  • Y. Gan, Y. Zhang, K. Chen, et al., “Root Cause Analysis of Failures in Microservices Through Anomaly Detection,” Proceedings of the IEEE International Conference on Cloud Computing (CLOUD), 2019.https://ieeexplore.ieee.org/document/8812060
  • M. Chen, A. Accardi, A. Archibald, et al., “AI for IT Operations (AIOps): Challenges and Opportunities,” IEEE Intelligent Systems, vol. 35, no. 2, pp. 6–14, 2020.https://doi.org/10.1109/MIS.2020.2973845
  • Z. Chen, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” IEEE Transactions on Network and Service Management, 2024.https://arxiv.org/abs/2406.06975
  • A. Lavin and S. Ahmad, “Evaluating Real-Time Anomaly Detection Algorithms,” IEEE International Conference on Machine Learning and Applications (ICMLA), 2015.https://doi.org/10.1109/ICMLA.2015.141
  • Z. Chen et al., “An Anomaly Detection Algorithm for Microservice Architecture Based on Robust Principal Component Analysis,” IEEE Access, vol. 8, pp. 226397–226408, 2020.https://doi.org/10.1109/access.2020.3044610
  • Z. Chen, Z. Jiang, Y. Su, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” 2024 IEEE 17th International Conference on Cloud Computing (CLOUD), Shenzhen, China, 2024, pp. 54–65. https://doi.org/10.1109/CLOUD62652.2024.00016
  • J. Soldani and A. Brogi, “Anomaly Detection and Failure Root Cause Analysis in Microservice-Based Cloud Applications: A Survey,” Journal of Systems and Software, 2021.https://doi.org/10.48550/arXiv.2105.12378
  • V.-H. Le and H. Zhang, “Log-Based Anomaly Detection Without Log Parsing,” 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 2021, pp. 492–504.https://doi.org/10.1109/ASE51524.2021.9678773
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

AIOps Intelligent Observability Microservices Monitoring Automated Incident Response Self-Healing Systems Distributed Tracing Anomaly Detection E-Commerce Infrastructure

Powered by PhDFocusTM