Survey on AI-Based Reliability and Anomaly Detection in Microservices

Muzeeb Mohammad

Research Article

Survey on AI-Based Reliability and Anomaly Detection in Microservices

by Muzeeb Mohammad

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 74

Published: January 2026

Authors: Muzeeb Mohammad

10.5120/ijca2026926263

PDF

Muzeeb Mohammad . Survey on AI-Based Reliability and Anomaly Detection in Microservices. International Journal of Computer Applications. 187, 74 (January 2026), 56-63. DOI=10.5120/ijca2026926263

                        @article{ 10.5120/ijca2026926263,
                        author  = { Muzeeb Mohammad },
                        title   = { Survey on AI-Based Reliability and Anomaly Detection in Microservices },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 74 },
                        pages   = { 56-63 },
                        doi     = { 10.5120/ijca2026926263 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Muzeeb Mohammad
                        %T Survey on AI-Based Reliability and Anomaly Detection in Microservices%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 74
                        %P 56-63
                        %R 10.5120/ijca2026926263
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Microservice architectures enable scalable, agile applications, but their complexity introduces significant reliability challenges. Traditional monitoring often struggles to keep pace with the dynamic and distributed nature of microservices, motivating artificial--intelligence (AI)--driven techniques for proactive anomaly detection and fault management. This survey reviews the state of the art in applying AI to reliability engineering and anomaly detection in microservice-based systems. This paper proposes a taxonomy covering (i) the observability signals used by anomaly detectors---metrics, logs and traces; (ii) the modelling techniques employed---from statistical and classical machine learning through deep learning, graph-based methods and large language models; and (iii) the deployment layer at which detection operates---centralized cloud clusters, distributed edge environments and service meshes. This survey analyzes representative systems and frameworks, comparing their strengths, weaknesses, data requirements, evaluation metrics, scalability and interpretability. Common challenges such as the entropy gap in anomaly scoring, scarcity of real-world labelled anomalies, the need for explainable results and compute constraints in distributed environments are highlighted. This survey concludes with open problems and future directions, emphasizing opportunities in multimodal data fusion, federated and edge-based detection, and human-in-the-loop root-cause analysis for the next generation of reliable microservice ecosystems.

References

Xie, Z. et al. “Unsupervised anomaly detection on microservice traces through graph VAE.” Proceedings of the Web Conference (WWW), 2023.
Hrusto, A., Ali, N. B., Engström, E., and Wang, Y. “Monitoring data for anomaly detection in cloud-based systems: A systematic mapping study.” ACM Transactions on Software Engineering and Methodology, early access, Jun. 2025. doi: 10.1145/3744556.
Fernando, D., Rodriguez, M. A., Arroba, P., Ismail, L., and Buyya, R. “Efficient training approaches for performance anomaly detection models in edge computing environments.” ACM Transactions on Autonomous and Adaptive Systems, vol. 20, no. 2, art. 13, Jun. 2025. doi: 10.1145/3725736.
Panwar, R. and Supriya, M. “RLPRAF: Reinforcement learning-based proactive resource allocation framework for cloud environment.” IEEE Access, vol. 12, pp. 95986–96007, 2024. doi: 10.1109/ACCESS.2024.3421956.
Pedroso, D. F. and Almeida, L. “Anomaly detection and root cause analysis in cloud-native environments using large language models and Bayesian networks.” TechRxiv preprint, Feb. 2025. doi: 10.36227/techrxiv.174016565.57888427.
Kosińska, J. and Tobiasz, M. “Detection of cluster anomalies with machine learning techniques.” IEEE Access, vol. 10, pp. 110742–110753, 2022. doi: 10.1109/ACCESS.2022.3216080.
Jin, M. et al. “An anomaly detection algorithm for microservice architecture based on robust PCA.” IEEE Access, vol. 8, pp. 226397–226408, 2020. doi: 10.1109/ACCESS.2020.3044610.
Panahandeh, M. et al. “ServiceAnomaly: Anomaly detection in microservices using context propagation graphs.” Proceedings of the International Conference on Service-Oriented Computing (ICSOC), 2023.
Lin, Y. et al. “Microscope: End-to-end performance diagnosis in microservices using causal graphs.” Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
Cai, X. et al. “CauseInfer: Automated causality inference for performance debugging of microservice systems.” Proceedings of the IEEE/ACM International Conference on Program Comprehension (ICPC), 2023.
Panwar, D. X. et al. “Reinforcement learning-driven reliability management in microservice clusters.” Proceedings of the IEEE International Conference on Cloud Computing, 2024.
Soldani, J. and Brogi, A. “Anomaly detection and failure root cause analysis in (micro)service-based cloud applications: A survey.” ACM Computing Surveys, vol. 55, no. 3, pp. 1–38, 2022.
Usman, M. et al. “A survey on observability of distributed edge & container-based microservices.” IEEE Access, vol. 10, pp. 86904–86919, 2022.
Faseeha, U., Syed, H. J., Samad, F., Zehra, S., and Ahmed, H. “Observability in microservices: An in-depth exploration of frameworks, challenges, and deployment paradigms.” IEEE Access, early access, 2025. doi: 10.1109/ACCESS.2025.3562125.
Du, M. et al. “DeepLog: Anomaly detection and diagnosis from system logs through deep learning.” Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2017.
Chalapathy, R. and Chawla, S. “Deep learning for anomaly detection: A survey.” arXiv:1901.03407, 2019.
Pang, G. et al. “Deep learning for anomaly detection: A review.” ACM Computing Surveys, vol. 54, no. 2, pp. 1–38, 2022.
Xing, S., Wang, Y., and Liu, W. “Multi-dimensional anomaly detection and fault localization in microservice architectures: A dual-channel deep learning approach with causal inference.” Sensors, vol. 25, no. 11, art. 3396, 2025. doi: 10.3390/s25113396.
Chen, J. et al. “TraceGra: A trace-based anomaly detection for microservices using graph deep learning.” Computer Communications, vol. 204, pp. 109–117, 2023.
Kohyarnejadfard, I. et al. “Anomaly detection in microservice environments using distributed tracing data analysis and NLP.” Journal of Cloud Computing, vol. 11, no. 1, p. 25, 2022.
Galappaththi, A. et al. “GAL-MAD: Graph attention and LSTM-based microservice anomaly detection.” arXiv:2504.00058, 2025.
Fan, M., Zhang, X., Wang, P., and Cao, Z. “Multi-modal anomaly detection for microservice system through nested graph diffusion reconstruction.” Applied Intelligence, vol. 55, art. 784, 2025. doi: 10.1007/s10489-025-06681-1.
Steenwinckel, B. et al. “FLAGS: A methodology for adaptive anomaly detection and root cause analysis on sensor data streams by fusing expert knowledge with machine learning.” Future Generation Computer Systems, vol. 116, pp. 30–48, 2021.
Chandola, V., Banerjee, A., and Kumar, V. “Anomaly detection: A survey.” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58, 2009.
Lavin, A. and Ahmad, S. “Evaluating real-time anomaly detection algorithms: The Numenta anomaly benchmark.” Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), 2015, pp. 38–44.
Bakhtin, A. et al. “LO2: Microservice API anomaly dataset of logs and metrics.” Proceedings of the International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), 2025.
Elkhairi, A. et al. “ReplicaWatcher: Training-less anomaly detection in containerized microservices via replica comparison.” Proceedings of the Network and Distributed System Security Symposium (NDSS), 2024.
Zuo, Y. et al. “An intelligent anomaly detection scheme for microservices architectures with temporal and spatial data analysis.” IEEE Transactions on Cognitive Communications and Networking, vol. 6, no. 2, pp. 548–561, 2020.
Yu, G. et al. “Nezha: Interpretable fine-grained root cause analysis for microservices on multi-modal observability data.” Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023.
Zhong, Z. et al. “A survey of time series anomaly detection methods in the AIOps domain.” arXiv:2308.00393, 2023.
Pimentel, D. A. et al. “A review of novelty detection.” Signal Processing, vol. 99, pp. 215–249, 2014.
Braei, A. and Wagner, A. “Anomaly detection in univariate time series: A survey.” Journal of Big Data, vol. 7, no. 1, p. 66, 2020.
Goldstein, M. and Uchida, S. “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data.” PLOS One, vol. 11, no. 4, e0152173, 2016.
Wang, T. and Qi, G. “A comprehensive survey on root cause analysis in (micro)services: Methodologies, challenges, and trends.” arXiv:2408.00803, 2024.
Liu, F. T., Ting, K. M., and Zhou, Z.-H. “Isolation forest.” Proceedings of the IEEE International Conference on Data Mining (ICDM), 2008, pp. 413–422.
Schölkopf, B. et al. “Estimating the support of a high-dimensional distribution.” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.
Breunig, M. M. et al. “LOF: Identifying density-based local outliers.” Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104.
Lee, C. et al. “EADRO: An end-to-end troubleshooting framework for microservice systems using multimodal observability data.” Journal of Systems and Software, vol. 200, art. 111571, 2023.
Janiesch, C. et al. “The rise of artificial intelligence for IT operations.” Business & Information Systems Engineering, vol. 63, no. 4, pp. 619–628, 2021.
Basiri, A. et al. “Chaos engineering: A new approach to enhance system resilience.” IEEE Software, vol. 35, no. 3, pp. 30–36, 2018.
Hundman, K. et al. “Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding.” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 387–395.
Gupta, M. et al. “Outlier detection for temporal data: A survey.” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2014.
Gulenko, A. et al. “Evaluating anomaly detection techniques in microservice architectures.” Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA), 2019.
Matos, E. A. et al. “A comparative study of anomaly detection techniques for cloud applications.” Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2017, pp. 59–66.
Akoglu, L., Tong, H., and Koutra, D. “Graph-based anomaly detection and description: A survey.” Data Mining and Knowledge Discovery, vol. 29, no. 3, pp. 626–688, 2015.
Nedelkoski, S. et al. “Self-learning anomaly detection from system log data.” Knowledge-Based Systems, vol. 195, art. 105648, 2020.
Rzym, G. et al. “Dynamic telemetry and deep neural networks for anomaly detection in 6G software-defined networks.” Electronics, vol. 13, no. 2, p. 382, 2024.
Mohammad, M. “Green Microservices: Energy-Efficient Design Strategies for Cloud-Native Financial Systems.” International Journal of Computer Applications (IJCA), vol. 187, no. 56, pp. 45–54, 2025. doi: 10.5120/ijca2025925975.
Zhang, Z. et al. “LogBERT: A transformer-based universal log anomaly detector.” Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2021, pp. 310–321
Mohammad, M. “AI-Assisted Zero-Trust Optimization for Energy-Efficient Microservices in Financial Systems.” International Journal of Computer Applications (IJCA), vol. 187, no. 67, pp. 34–45, Dec. 2025. doi: 10.5120/ijca2025926171.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Microservices anomaly detection reliability engineering observability machine learning root cause analysis cloud computing reinforcement learning large language models Bayesian networks