Automatic Name-Based Software Bug Detection via AST-Driven Static Analysis and Machine Learning

Dipu Dahal; Shirshak Acharya; Yunij Karki; Shashank Ghimire; Dinesh Baniya Kshatri

Research Article

Automatic Name-Based Software Bug Detection via AST-Driven Static Analysis and Machine Learning

by Dipu Dahal, Shirshak Acharya, Yunij Karki, Shashank Ghimire, Dinesh Baniya Kshatri

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 39

Published: September 2025

Authors: Dipu Dahal, Shirshak Acharya, Yunij Karki, Shashank Ghimire, Dinesh Baniya Kshatri

10.5120/ijca2025925682

PDF

Dipu Dahal, Shirshak Acharya, Yunij Karki, Shashank Ghimire, Dinesh Baniya Kshatri . Automatic Name-Based Software Bug Detection via AST-Driven Static Analysis and Machine Learning. International Journal of Computer Applications. 187, 39 (September 2025), 13-22. DOI=10.5120/ijca2025925682

                        @article{ 10.5120/ijca2025925682,
                        author  = { Dipu Dahal,Shirshak Acharya,Yunij Karki,Shashank Ghimire,Dinesh Baniya Kshatri },
                        title   = { Automatic Name-Based Software Bug Detection via AST-Driven Static Analysis and Machine Learning },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 39 },
                        pages   = { 13-22 },
                        doi     = { 10.5120/ijca2025925682 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Dipu Dahal
                        %A Shirshak Acharya
                        %A Yunij Karki
                        %A Shashank Ghimire
                        %A Dinesh Baniya Kshatri
                        %T Automatic Name-Based Software Bug Detection via AST-Driven Static Analysis and Machine Learning%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 39
                        %P 13-22
                        %R 10.5120/ijca2025925682
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper presents a name analysis technique for statically typed languages to automatically classify and localize specific bugs in source code, eliminating the need for manually designed algorithms or heuristics. Name-based bug detection involves analyzing source code to detect potential bugs based on the names or labels used for variables, functions and other elements in the code. The Abstract Syntax Tree (AST) of the source code is utilized to automatically generate negative (buggy) samples due to the unavailability of a large set of negative samples. Approximately 720,000 code snippets of C language are collected from a large C code corpus and parsed into their corresponding ASTs using LibClang. Positive samples are extracted from AST and their contents are adjusted to generate negative samples. These samples are tokenized using a fine-tuned tokenizer and fed into a classification model for training to identify potential bugs. This paper describes techniques for detecting bugs related to swapped function arguments, wrong binary operators and wrong operator precedence, with a high F1 score between 83% and 95%. Moreover, the detection of new types of bugs can be easily accomplished by following similar steps taken in developing current bug detectors. The resulting system can automatically detect specific types of bugs in source code, serving as a tool that enhances code quality for software developers.

References

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, page 38–49, New York, NY, USA, 2015. Association for Computing Machinery.
Hui Liu, Qiurong Liu, Cristian-Alexandru Staicu, Michael Pradel, and Yue Luo. Nomen est omen: Exploring and exploiting similarities between argument and parameter names. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, page 1063–1073, New York, NY, USA, 2016. Association for Computing Machinery.
Michael Pradel and Thomas R. Gross. Detecting anomalies in the order of equally-typed method arguments. In Proceedings of the 2011 International Symposium on Software Testing and Analysis, ISSTA ’11, page 232–242, New York, NY, USA, 2011. Association for Computing Machinery.
Michael Pradel and Thomas R. Gross. Name-based analysis of equally typed method arguments. IEEE Transactions on Software Engineering, 39(8):1127–1143, 2013.
Michael Pradel and Koushik Sen. Deepbugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang., 2(OOPSLA), October 2018.
Guangjie Li, Yi Tang, Xiang Zhang, and Biyi Yi. Name-based approach to identify suspicious return statements. Journal of Physics: Conference Series, 1792(1):012018, feb 2021.
David Hovemeyer and William Pugh. Finding bugs is easy. In Companion to the 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’04, page 132–136, New York, NY, USA, 2004. Association for Computing Machinery.
Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Krishnan. Building useful program analysis tools using an extensible java compiler. In 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pages 14–23, 2012.
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fixing common c language errors by deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017.
Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. Detecting argument selection defects. Proc. ACM Program. Lang., 1(OOPSLA), October 2017.
Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. Self-supervised bug detection and repair. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34 of NIPS ’21, pages 27865–27876, Red Hook, NY, USA, 2021. Curran Associates, Inc.
Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Finding likely errors with bayesian specifications. CoRR, abs/1703.01370, 2017.
Song Wang, Devin Chollak, Dana Movshovitz-Attias, and Lin Tan. Bugram: bug detection with n-gram language models. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE ’16, page 708–719, New York, NY, USA, 2016. Association for Computing Machinery.
Min-Je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. End-to-end prediction of buffer overruns from raw source code via neural memory networks. CoRR, abs/1703.02458, 2017.
Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. Semantic code repair using neuro-symbolic transformation networks. CoRR, abs/1710.11054, 2017.
Sahil Bhatia and Rishabh Singh. Automated correction for syntax errors in programming assignments using recurrent neural networks. CoRR, abs/1603.06129, 2016.
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. Lava: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP), pages 110–121, 2016.
Jannik Pewny and Thorsten Holz. Evilcoder: automated bug insertion. In Proceedings of the 32nd Annual Conference on Computer Security Applications, ACSAC ’16, page 214–225, New York, NY, USA, 2016. Association for Computing Machinery.
Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. Big code != big vocabulary: open-vocabulary models for source code. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1073–1085, New York, NY, USA, 2020. Association for Computing Machinery.
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. CoRR, abs/2109.00859, 2021.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, volume 30 of NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Abstract Syntax Tree C Language Machine Learning Hyperparameter Embedding Vector Bidirectional Encoder Representations from Transformers (BERT)