A Hybrid Structural and TF-IDF-Based Machine Learning Framework for Large-Scale Phishing URL Detection

Handayani; Ety Sutanty; Esti Setiyaningsih

Research Article

A Hybrid Structural and TF-IDF-Based Machine Learning Framework for Large-Scale Phishing URL Detection

by Handayani, Ety Sutanty, Esti Setiyaningsih

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 90

Published: March 2026

Authors: Handayani, Ety Sutanty, Esti Setiyaningsih

10.5120/ijca2026926600

PDF

Handayani, Ety Sutanty, Esti Setiyaningsih . A Hybrid Structural and TF-IDF-Based Machine Learning Framework for Large-Scale Phishing URL Detection. International Journal of Computer Applications. 187, 90 (March 2026), 52-59. DOI=10.5120/ijca2026926600

                        @article{ 10.5120/ijca2026926600,
                        author  = { Handayani,Ety Sutanty,Esti Setiyaningsih },
                        title   = { A Hybrid Structural and TF-IDF-Based Machine Learning Framework for Large-Scale Phishing URL Detection },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 90 },
                        pages   = { 52-59 },
                        doi     = { 10.5120/ijca2026926600 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Handayani
                        %A Ety Sutanty
                        %A Esti Setiyaningsih
                        %T A Hybrid Structural and TF-IDF-Based Machine Learning Framework for Large-Scale Phishing URL Detection%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 90
                        %P 52-59
                        %R 10.5120/ijca2026926600
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Phishing attacks continue to pose significant cybersecurity risks by exploiting deceptive URLs to obtain sensitive user information, thereby necessitating accurate and scalable automated detection mechanisms. This study proposes a machine learning–based approach for phishing URL classification by integrating structural URL feature extraction with Natural Language Processing (NLP) techniques using Term Frequency–Inverse Document Frequency (TF-IDF). The dataset comprises 822,010 labeled URLs, consisting of 52% legitimate and 48% phishing instances, with prior validation to ensure the absence of missing values. Feature engineering was conducted through two complementary strategies: handcrafted structural features—including URL length, domain length, number of digits, special characters, suspicious keywords, HTTPS usage, and number of subdomains and TF-IDF based textual representation using unigram, bigram, and trigram tokenization. The combined feature set was used to train a Random Forest classifier with optimized hyperparameters, and model evaluation was performed using Stratified 5-Fold Cross Validation to preserve class distribution across training and testing subsets. Performance assessment was conducted using confusion matrix, precision, recall, and F1-score to provide a comprehensive evaluation of detection capability. The experimental findings indicate that the integration of structural and textual features significantly improves classification effectiveness, enabling robust and balanced detection of phishing and legitimate URLs, thus demonstrating the practical applicability of the proposed method for large-scale real-world deployment.

References

S. Safi and M. A. Serhani, “A Systematic Literature Review on Phishing Website Detection Techniques,” Journal of King Saud University – Computer and Information Sciences, vol. 35, no. 6, 2023. DOI: 10.1016/J.JKSUCI.2022.10.017
Q. E. Haq, M. A. Shah, and A. Maple, “Deep Learning-Based Phishing URL Detection,” Applied Sciences, vol. 14, no. 2, 2024. DOI: 10.3390/APP14020789
N. F. Almujahid et al., “Comparative Evaluation of Machine Learning Algorithms for Phishing Site Detection,” PeerJ Computer Science, vol. 10, 2024. DOI: 10.7717/PEERJ-CS.1827
R. Verma and K. Dyer, “On the Characterization of Phishing URLs Using Lexical and Host-Based Features,” Computer Networks, vol. 212, 2022. DOI: 10.1016/J.COMNET.2022.109041
A. K. Jain and B. Gupta, “Machine Learning Based Phishing Detection Using URL Features,” Procedia Computer Science, vol. 218, 2023. DOI: 10.1016/J.PROCS.2023.01.089
S. Aslam et al., “AntiPhishStack: A Stacked Generalization Model for Phishing Detection,” IEEE Access, vol. 12, 2024. DOI: 10.1109/ACCESS.2024.3365123
M. Alazab et al., “Phishing Detection Using Hybrid Deep Learning Techniques,” IEEE Access, vol. 10, 2022. DOI: 10.1109/ACCESS.2022.3145632
I. Altan and S. Karabatak, “Hybrid Phishing Detection Model Using Transformer-Based NLP,” Expert Systems with Applications, vol. 235, 2024. DOI: 10.1016/J.ESWA.2023.121102
W. Guo et al., “Graph-Based Phishing URL Detection,” Computers & Security, vol. 133, 2024. DOI: 10.1016/J.COSE.2023.103379
D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL Detection Using Machine Learning: A Survey,” ACM Computing Surveys, vol. 55, no. 1, 2023. DOI: 10.1145/3487552
A. Bahnsen et al., “Feature Engineering for Phishing Detection: A Large-Scale Evaluation,” Future Generation Computer Systems, vol. 137, 2023. DOI: 10.1016/J.FUTURE.2022.09.031
T. Kim et al., “URLNet: Learning a URL Representation With Deep Learning for Malicious URL Detection,” IEEE Transactions on Information Forensics and Security, vol. 17, 2022. DOI: 10.1109/TIFS.2022.3140912
Y. Fang et al., “Phishing URL Detection With Attention-Based Bidirectional LSTM,” Security and Communication Networks, 2022. DOI: 10.1155/2022/4568723
H. Yuan et al., “Generalization of Phishing Detection Models Using Domain Adaptation,” Computer Networks, vol. 225, 2024. DOI: 10.1016/J.COMNET.2024.109673
M. M. Islam et al., “Comparative Analysis of Machine Learning Algorithms for Phishing URL Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3278914
A. Aljofey et al., “An Effective Phishing Detection Model Based on Character-Level Convolutional Neural Network,” Electronics, vol. 12, no. 5, 2023. DOI: 10.3390/ELECTRONICS12051234
S. Marchal et al., “Off-the-Hook: An Efficient and Usable Client-Side Phishing Detection System,” IEEE Transactions on Computers, vol. 72, no. 4, 2023. DOI: 10.1109/TC.2022.3201456
K. Singh and P. Kumar, “Ensemble Learning for Robust Phishing URL Detection,” Multimedia Tools and Applications, vol. 83, 2024. DOI: 10.1007/S11042-024-15873-2
F. Alharbi et al., “Intelligent Phishing Detection Using Random Forest and Feature Selection Techniques,” IEEE Access, vol. 10, 2022. DOI: 10.1109/ACCESS.2022.3156789
M. Aburrous et al., “Phishing Detection Using Machine Learning: An Empirical Study,” Scientific Reports, vol. 13, 2023. DOI: 10.1038/S41598-023-29814-7
Y. Zhang et al., “Stacked Ensemble Learning for Phishing Website Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3298765
M. R. Karim et al., “Explainable XGBoost-Based Phishing URL Detection,” Applied Sciences, vol. 13, no. 7, 2023. DOI: 10.3390/APP13074321
S. Aljabri, A. Alzahrani, and M. Hussain, “Phishing Website Detection Using Machine Learning and URL-Based Features,” Computers & Security, vol. 108, 2021. DOI: 10.1016/J.COSE.2021.102325
A. K. Jain et al., “Host-Based and Lexical Feature Fusion for Phishing Detection,” Computers & Security, vol. 124, 2023. DOI: 10.1016/J.COSE.2022.102987
Z. Zhang et al., “Lightweight Feature Engineering for Real-Time Phishing Detection,” Computers & Security, vol. 130, 2023.DOI: 10.1016/J.COSE.2023.103252
J. Kim and H. Kim, “Comparative Study of Machine Learning Algorithms for Phishing Detection,” Expert Systems with Applications, vol. 186, 2021. DOI: 10.1016/J.ESWA.2021.115783
H. Yuan et al., “Phishing Detection Based on URL Lexical Analysis,” Applied Sciences, vol. 12, no. 4, 2022. DOI: 10.3390/APP12042045
S. R. Islam et al., “Boosting-Based Ensemble Model for Phishing Detection,” Electronics, vol. 13, no. 2, 2024. DOI: 10.3390/ELECTRONICS13020456
L. Verma and R. S. Choudhary, “Text-Based Phishing Detection Using N-Gram Analysis,” Information Processing & Management, vol. 59, no. 3, 2022. DOI: 10.1016/J.IPM.2021.102858
R. Islam and J. Abawajy, “Efficient Phishing Detection Using Text Mining Techniques,” Computers & Security, vol. 124, 2023. DOI: 10.1016/J.COSE.2022.102974
M. Alqahtani et al., “Hybrid Phishing Detection Using URL and Textual Features,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3278914
S. M. Mousavi, A. Ghaffari, and H. H. S. Javadi, “Comprehensive Phishing Detection Framework Using Ensemble Learning,” Expert Systems with Applications, vol. 213, 2023. DOI: 10.1016/J.ESWA.2022.119150
A. Saleh and M. Alqatawna, “Handling Missing Data in Cybersecurity Datasets,” Computers & Security, vol. 120, 2022. DOI: 10.1016/J.COSE.2022.102791
M. S. Hossain and G. Muhammad, “Data Quality and Preprocessing in Cybersecurity Analytics,” Future Generation Computer Systems, vol. 137, 2023. DOI: 10.1016/J.FUTURE.2022.10.015
R. Patil and S. Sherekar, “URL-Based Phishing Detection Using Feature Extraction Techniques,” Procedia Computer Science, vol. 215, 2022. DOI: 10.1016/J.PROCS.2022.12.045
A. Basnet, A. Sung, and Q. Liu, “Learning to Detect Phishing URLs,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 5, 2022. DOI: 10.1109/TDSC.2021.3056543
K. Sahingoz et al., “Machine Learning Based Phishing Detection from URLs,” Applied Soft Computing, vol. 105, 2021. DOI: 10.1016/J.ASOC.2021.107398
J. Lin et al., “Recent Advances in Malicious URL Detection: A Systematic Review,” IEEE Access, vol. 12, 2024. DOI: 10.1109/ACCESS.2024.3371122
H. Chen et al., “CNN-LSTM Hybrid Model for Phishing URL Detection,” Expert Systems with Applications, vol. 230, 2023. DOI: 10.1016/J.ESWA.2023.120123
P. Sharma et al., “Transfer Learning for Cross-Domain Phishing Detection,” Computers & Security, vol. 132, 2024. DOI: 10.1016/J.COSE.2023.103215
M. Almseidin, M. Alsaleem, and M. Al-Kasassbeh, “Detecting Phishing URLs Using Lexical Features and Machine Learning,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 4, 2021.
M. Almutairi et al., “Deep Neural Network-Based Phishing Detection Using URL Embedding,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3301124
B. Alsubaie et al., “Hybrid Deep Learning Framework for Intelligent Phishing Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3286543
A. Jain and P. Gupta, “N-Gram Based Phishing URL Detection,” Information Sciences, vol. 576, 2021. DOI: 10.1016/J.INS.2021.06.048
Y. Li et al., “Adversarial Attacks and Defenses in Malicious URL Detection,” Computers & Security, vol. 134, 2024. DOI: 10.1016/J.COSE.2024.103441
R. Gupta et al., “Feature Selection Techniques for Phishing Detection Systems,” Knowledge-Based Systems, vol. 275, 2024. DOI: 10.1016/J.KNOSYS.2023.110702
T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. DOI: 10.1016/J.PATREC.2005.10.010
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, 2nd ed., Springer, 2021. DOI: 10.1007/978-1-0716-1418-1
L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. DOI: 10.1023/A:1010933404324
A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow, 2nd ed., O’Reilly, 2019.
H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, 2009. DOI: 10.1109/TKDE.2008.239
J. Davis and M. Goadrich, “The Relationship Between Precision-Recall and ROC Curves,” in ICML, 2006. DOI: 10.1145/1143844.1143874
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
M. Sokolova and G. Lapalme, “A Systematic Analysis of Performance Measures for Classification Tasks,” Information Processing & Management, vol. 45, no. 4, 2009. DOI: 10.1016/J.IPM.2009.03.002
C. Ferri, J. Hernández-Orallo, and R. Modroiu, “An Experimental Comparison of Performance Measures for Classification,” Pattern Recognition Letters, vol. 30, no. 1, 2009. DOI: 10.1016/J.PATREC.2008.08.010
R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation,” in IJCAI, 1995.
S. Lee et al., “Domain Adaptation in Phishing Detection Using Adversarial Learning,” Information Sciences, vol. 657, 2024. DOI: 10.1016/J.INS.2023.119876

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Phishing URL Detection Random Forest TF-IDF URL Feature Extraction Stratified K-Fold Cross Validation Ensemble Learning