Research Article

Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT

by  Harun Hadzagic, Zerina Altoka
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 90
Published: March 2026
Authors: Harun Hadzagic, Zerina Altoka
10.5120/ijca2026926575
PDF

Harun Hadzagic, Zerina Altoka . Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT. International Journal of Computer Applications. 187, 90 (March 2026), 16-22. DOI=10.5120/ijca2026926575

                        @article{ 10.5120/ijca2026926575,
                        author  = { Harun Hadzagic,Zerina Altoka },
                        title   = { Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 90 },
                        pages   = { 16-22 },
                        doi     = { 10.5120/ijca2026926575 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2026
                        %A Harun Hadzagic
                        %A Zerina Altoka
                        %T Synthetic Data Generation For Automated JavaScript Vulnerability Detection Using Fine-Tuned CodeBERT%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 90
                        %P 16-22
                        %R 10.5120/ijca2026926575
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

The dynamic and flexible nature of JavaScript, the foundational language of modern web development, makes it highly susceptible to vulnerabilities such as Cross-Site Scripting (XSS), SQL Injection, and Hardcoded Secrets. Traditional security analysis tools, as well as manual code review, struggle to maintain accuracy and scalability in complex codebases, especially with the increasing use of AI in code production. To address this, this paper presents a high-performance solution utilizing a CodeBERT transformer model fine-tuned for automated binary sequence classification. A balanced dataset constructed of 71 vulnerabilities with 60 JavaScript code snippets (30 pairs of secure and insecure versions) generated through advanced LLMs. Employing a rigorous Pair-ID splitting methodology, it ensured the model was evaluated on truly unseen vulnerability patterns, preventing data leakage and overfitting. The fine-tuned CodeBERT model achieved exceptional performance on the held-out test set, culminating in an F1-Score of 0.9413. Crucially, the model attained a Recall of 0.9468 for the 'Insecure' class, confirming its ability to minimize missed vulnerabilities, the most critical error in security screening. Furthermore, a generalization check using an alternating dataset validated the model's robustness, maintaining a high F1-Score. The findings demonstrate the viability of specialized Code LLMs for reliable vulnerability detection, paving the way for low-latency integration into continuous integration pipelines to enforce secure coding practices in real time.

References
  • Kluban, M., Mannan, M., & Youssef, A. (2022). On measuring vulnerable JavaScript functions in the wild. Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 917–930. https://doi.org/10.1145/3488932.3497769
  • Chin, K. (2025, June 29). Biggest data breaches in Europe [Updated 2025]. UpGuard. https://www.upguard.com/blog/biggest-data-breaches-europe
  • Hollander, M. (2020, August 24). Most common security vulnerabilities using JavaScript. SecureCoding. https://www.securecoding.com/blog/most-common-security-vulnerabilities-using-javascript/
  • Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev. (2023). ChatGPT: Limitations in Vulnerability Detection for Programming Languages (arXiv preprint arXiv:2304.07232). Retrieved from https://arxiv.org/pdf/2304.07232
  • Achimugu, P., Selamat, A., Ibrahim, R., & Mahrin, M. N. (2014). A systematic literature review of Software Requirements Prioritization Research. Information and Software Technology, 56(6), 568–585. https://doi.org/10.1016/j.infsof.2014.02.001
  • Fang, Q. et al. (2018). Detecting DOM-based XSS with Static Analysis.
  • Russell, R. et al. (2018). Automated Vulnerability Detection in Source Code Using Deep Representation Learning.
  • Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Rogers, K. K., Watt, R. K., & Nicholson, P. K. (2018). Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1802.08038.
  • Hoang, T., Kang, H. J., Lo, D., & Lawall, J. (2020). Hierarchical Graph Neural Network for Open-Source Software Vulnerability Detection. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), 385–396.
  • Hanif, S., & Maffeis, S. (2022). VulBERTa: Learning Deep Representations of Code for Vulnerability Detection. arXiv:2203.13460.
  • Lin, Z. et al. (2020). Software Vulnerability Detection Using Deep Learning: A Survey.
  • Wessel, M., Serebrenik, A., Wermelinger, M., Rossi, B., & Steinmacher, I. (2020). What to expect from code review bots on GitHub? A survey of open source projects. IEEE Software, 38(3), 67–75.
  • Lu, Y., Wang, H., & Wei, W. (2023). Machine Learning for Synthetic Data Generation: a Review. https://doi.org/10.48550/arxiv.2302.04062
  • Khanna, C. (2021, August 13). Byte-Pair Encoding: Subword-based tokenization algorithm. Medium. https://medium.com/data-science/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
  • Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). Codebert: A pre-trained model for programming and natural languages. Findings of the Association for Computational Linguistics: EMNLP 2020, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  • Hao, Y., Tang, Z., Tian, Y., Zhang, Y., & Zhou, Z. (2024). AdamW. Cornell Optimization. Retrieved November 21, 2025
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Synthetic Data Generation JavaScript Vulnerability Detection CodeBERT Static Code Analysis Secure JavaScript Transformer Models

Powered by PhDFocusTM