Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT

Harun Hadzagic; Zerina Altoka

Research Article

Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT

by Harun Hadzagic, Zerina Altoka

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 90

Published: March 2026

Authors: Harun Hadzagic, Zerina Altoka

10.5120/ijca2026926575

PDF

Harun Hadzagic, Zerina Altoka . Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT. International Journal of Computer Applications. 187, 90 (March 2026), 16-22. DOI=10.5120/ijca2026926575

                        @article{ 10.5120/ijca2026926575,
                        author  = { Harun Hadzagic,Zerina Altoka },
                        title   = { Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 90 },
                        pages   = { 16-22 },
                        doi     = { 10.5120/ijca2026926575 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Harun Hadzagic
                        %A Zerina Altoka
                        %T Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 90
                        %P 16-22
                        %R 10.5120/ijca2026926575
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

The dynamic and flexible nature of JavaScript, the foundational language of modern web development, makes it highly susceptible to vulnerabilities such as Cross-Site Scripting (XSS), SQL Injection, and Hardcoded Secrets. Traditional security analysis tools, as well as manual code review, struggle to maintain accuracy and scalability in complex codebases, especially with the increasing use of AI in code production. To address this, this paper presents a high-performance solution utilizing a CodeBERT transformer model fine-tuned for automated binary sequence classification. A balanced dataset constructed of 71 vulnerabilities with 60 JavaScript code snippets (30 pairs of secure and insecure versions) generated through advanced LLMs. Employing a rigorous Pair-ID splitting methodology, it ensured the model was evaluated on truly unseen vulnerability patterns, preventing data leakage and overfitting. The fine-tuned CodeBERT model achieved exceptional performance on the held-out test set, culminating in an F1-Score of 0.9413. Crucially, the model attained a Recall of 0.9468 for the 'Insecure' class, confirming its ability to minimize missed vulnerabilities, the most critical error in security screening. Furthermore, a generalization check using an alternating dataset validated the model's robustness, maintaining a high F1-Score. The findings demonstrate the viability of specialized Code LLMs for reliable vulnerability detection, paving the way for low-latency integration into continuous integration pipelines to enforce secure coding practices in real time.

References

Kluban, M., Mannan, M., & Youssef, A. (2022). On measuring vulnerable JavaScript functions in the wild. Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 917–930. https://doi.org/10.1145/3488932.3497769
Chin, K. (2025, June 29). Biggest data breaches in Europe [Updated 2025]. UpGuard. https://www.upguard.com/blog/biggest-data-breaches-europe
Hollander, M. (2020, August 24). Most common security vulnerabilities using JavaScript. SecureCoding. https://www.securecoding.com/blog/most-common-security-vulnerabilities-using-javascript/
Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev. (2023). ChatGPT: Limitations in Vulnerability Detection for Programming Languages (arXiv preprint arXiv:2304.07232). Retrieved from https://arxiv.org/pdf/2304.07232
Achimugu, P., Selamat, A., Ibrahim, R., & Mahrin, M. N. (2014). A systematic literature review of Software Requirements Prioritization Research. Information and Software Technology, 56(6), 568–585. https://doi.org/10.1016/j.infsof.2014.02.001
Fang, Q. et al. (2018). Detecting DOM-based XSS with Static Analysis.
Russell, R. et al. (2018). Automated Vulnerability Detection in Source Code Using Deep Representation Learning.
Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Rogers, K. K., Watt, R. K., & Nicholson, P. K. (2018). Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1802.08038.
Hoang, T., Kang, H. J., Lo, D., & Lawall, J. (2020). Hierarchical Graph Neural Network for Open-Source Software Vulnerability Detection. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), 385–396.
Hanif, S., & Maffeis, S. (2022). VulBERTa: Learning Deep Representations of Code for Vulnerability Detection. arXiv:2203.13460.
Lin, Z. et al. (2020). Software Vulnerability Detection Using Deep Learning: A Survey.
Wessel, M., Serebrenik, A., Wermelinger, M., Rossi, B., & Steinmacher, I. (2020). What to expect from code review bots on GitHub? A survey of open source projects. IEEE Software, 38(3), 67–75.
Lu, Y., Wang, H., & Wei, W. (2023). Machine Learning for Synthetic Data Generation: a Review. https://doi.org/10.48550/arxiv.2302.04062
Khanna, C. (2021, August 13). Byte-Pair Encoding: Subword-based tokenization algorithm. Medium. https://medium.com/data-science/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). Codebert: A pre-trained model for programming and natural languages. Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
Hao, Y., Tang, Z., Tian, Y., Zhang, Y., & Zhou, Z. (2024). AdamW. Cornell Optimization. Retrieved November 21, 2025

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Synthetic Data Generation JavaScript Vulnerability Detection CodeBERT Static Code Analysis Secure JavaScript Transformer Models