Enhancing Representation Learning with Spatial Transformation and Early Convolution for Reinforcement Learning-based Small Object Detection

Page view(s)
156
Checked on Feb 11, 2025
Enhancing Representation Learning with Spatial Transformation and Early Convolution for Reinforcement Learning-based Small Object Detection
Title:
Enhancing Representation Learning with Spatial Transformation and Early Convolution for Reinforcement Learning-based Small Object Detection
Journal Title:
IEEE Transactions on Circuits and Systems for Video Technology
Publication Date:
09 June 2023
Citation:
Fang, F., Liang, W., Cheng, Y., Xu, Q., & Lim, J.-H. (2023). Enhancing Representation Learning with Spatial Transformation and Early Convolution for Reinforcement Learning-based Small Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 1–1. https://doi.org/10.1109/tcsvt.2023.3284453
Abstract:
Although object detection has achieved significant progress in the past decade, detecting small objects is still far from satisfactory due to the high variability of object scales and complex backgrounds. The common way to enhance small object detection is to use high-resolution (HR) images. However, this method incurs huge computational resources which grow squarely with the resolution of images. To achieve both accuracy and efficiency, we propose a novel reinforcement learning framework that employs an efficient policy network consisting of a Spatial Transformation Network to enhance the state representation learning and a Transformer model with early convolution to improve feature extraction. Our method has two main steps: (1) coarse location query (CLQ), where an RL agent is trained to predict the locations of small objects on low-resolution (LR) (down-sampled version of HR) images; (2) context-sensitive object detection where HR image patches are used to detect objects on the selected coarse locations and LR image patches on background areas (containing no small objects). In this way, we can obtain high detection performance on small objects while avoiding unnecessary computation on background areas. The proposed method has been tested and benchmarked on various datasets. On the Caltech Pedestrians Detection and Web Pedestrians datasets, the proposed method improves the detection accuracy by 2%, while reducing the number of processed pixels. On the Vision meets Drone object detection dataset and the Oil and Gas Storage Tank dataset, the proposed method outperforms the state-of-the-art (SotA) methods. On MS COCO mini-val set, our method outperforms SotA methods on small object detection, while also achieving comparable performance on medium and large objects.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046
Description:
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
1558-2205
1051-8215
Files uploaded:

File Size Format Action
final-version-double-column-amend.pdf 5.45 MB PDF Request a copy