A Comprehensive Review of Object Detection: From Handicraft Features to Deep Convolutional and Transformer-Based Architectures

Aditya P. Bakshi

Research Article

A Comprehensive Review of Object Detection: From Handicraft Features to Deep Convolutional and Transformer-Based Architectures

by Aditya P. Bakshi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 72

Published: January 2026

Authors: Aditya P. Bakshi

10.5120/ijca2026926209

PDF

Aditya P. Bakshi . A Comprehensive Review of Object Detection: From Handicraft Features to Deep Convolutional and Transformer-Based Architectures. International Journal of Computer Applications. 187, 72 (January 2026), 41-49. DOI=10.5120/ijca2026926209

                        @article{ 10.5120/ijca2026926209,
                        author  = { Aditya P. Bakshi },
                        title   = { A Comprehensive Review of Object Detection: From Handicraft Features to Deep Convolutional and Transformer-Based Architectures },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 72 },
                        pages   = { 41-49 },
                        doi     = { 10.5120/ijca2026926209 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Aditya P. Bakshi
                        %T A Comprehensive Review of Object Detection: From Handicraft Features to Deep Convolutional and Transformer-Based Architectures%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 72
                        %P 41-49
                        %R 10.5120/ijca2026926209
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Object detection has experienced a substantial evolution over the past two decades, transitioning from handcrafted feature-based pipelines to highly expressive deep learning and transformer-driven architectures. Early detection systems relied on manually designed descriptors such as Histograms of Oriented Gradients (HOG) and Deformable Part Models (DPM), coupled with exhaustive sliding-window or part-based search strategies. While effective in constrained scenarios, these approaches were limited by weak semantic representation, sensitivity to scale and illumination variations, and poor generalization to complex real-world environments. The advent of deep convolutional neural networks (CNNs) fundamentally reshaped object detection by enabling end-to-end hierarchical feature learning from large-scale annotated datasets. This shift led to the development of region-proposal-based two-stage detectors, single-stage dense regression models, and, more recently, transformer-based architectures that reformulate detection as a global set prediction problem. This paper presents a comprehensive and in-depth review of modern object detection frameworks, systematically covering two-stage detectors, one-stage detectors, and transformer-driven models. The review emphasizes the theoretical foundations underlying these paradigms, including multi-scale feature learning, anchor-based and anchor-free localization strategies, attention mechanisms, loss function design, and hierarchical feature aggregation. Key innovations such as Feature Pyramid Networks, focal loss, deformable convolutions, and encoder–decoder transformers are critically analyzed to understand their impact on detection accuracy, convergence behavior, robustness, and computational efficiency. In addition, the survey examines benchmark datasets, evaluation protocols, training strategies, and deployment challenges, highlighting persistent issues such as small-object detection, long-tail class distributions, data efficiency, and inference latency. Finally, emerging research directions are discussed, including lightweight and efficient transformer architectures, multimodal and open-vocabulary object detection, self-supervised and semi-supervised pretraining, and unified perception models that integrate detection with segmentation and tracking. By synthesizing both theoretical insights and empirical trends, this review aims to provide a cohesive foundation for advancing robust, efficient, and scalable object detection systems.

References

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, pp. I–511–I–518.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
W. Liu et al., “SSD: Single shot multibox detector,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 21–35.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations, 2021.
N. Carion et al., “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
X. Zhu et al., “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proceedings of the International Conference on Learning Representations, 2021.
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
Y. Li et al., “ViTDet: Vision transformer for object detection,” arXiv preprint, arXiv:2203.16527, 2022.
P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
N. Dalal, “Histograms of oriented gradients (HOG): 2005,” Ph.D. dissertation, 2005.
P. Viola and M. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
T.-Y. Lin et al., “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162.
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint, arXiv:1804.02767, 2018.
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv preprint, arXiv:2004.10934, 2020.
C. Wang et al., “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint, arXiv:2207.02696, 2022.
Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9627–9636.
M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10778–10787.
S. Xie et al., “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5987–5995.
M. Everingham et al., “The PASCAL Visual Object Classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
O. Kuznetsova et al., “The Open Images dataset V4,” International Journal of Computer Vision, vol. 128, pp. 1956–1981, 2020.
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454.
D. H. Lee and H. J. Kim, “Object detection for autonomous driving using deep learning: A review,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 84–100, 2021.
G. Litjens et al., “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
S. Kamilaris and F. Prenafeta-Boldú, “Deep learning in agriculture: A survey,” Computers and Electronics in Agriculture, vol. 147, pp. 70–90, 2018.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Object Detection Image Classification Single-Stage Regression Transformer-Based Architectures Multi-Scale Reasoning Proposal Generation Multimodal Fusion Self-Supervised Pretraining Open-Vocabulary Detection.