TDAM: Top-Down Attention Module for Contextually Guided Feature Selection in CNNs

Page view(s)
23
Checked on Mar 25, 2024
TDAM: Top-Down Attention Module for Contextually Guided Feature Selection in CNNs
Title:
TDAM: Top-Down Attention Module for Contextually Guided Feature Selection in CNNs
Journal Title:
Computer Vision – ECCV 2022
Keywords:
Publication Date:
19 October 2022
Citation:
Jaiswal, S., Fernando, B., & Tan, C. (2022). TDAM: Top-Down Attention Module for Contextually Guided Feature Selection in CNNs. Computer Vision – ECCV 2022, 259–276. https://doi.org/10.1007/978-3-031-19806-9_15
Abstract:
Attention modules for Convolutional Neural Networks (CNNs) are an effective method to enhance performance on multiple computer-vision tasks. While existing methods appropriately model channel-, spatial- and self-attention, they primarily operate in a feedforward bottom-up manner. Consequently, the attention mechanism strongly depends on the local information of a single input feature map and does not incorporate relatively semantically-richer contextual information available at higher layers that can specify “what and where to look” in lower-level feature maps through top-down information flow. Accordingly, in this work, we propose a lightweight top-down attention module (TDAM) that iteratively generates a “visual searchlight” to perform channel and spatial modulation of its inputs and outputs more contextually-relevant feature maps at each computation step. Our experiments indicate that TDAM enhances the performance of CNNs across multiple object-recognition benchmarks and outperforms prominent attention modules while being more parameter and memory efficient. Further, TDAM-based models learn to “shift attention” by localizing individual objects or features at each computation step without any explicit supervision resulting in a 5\% improvement for ResNet50 on weakly-supervised object localization. Accordingly, in this work, we propose a lightweight top-down attention module (TDAM) that iteratively generates a “visual searchlight” to perform channel and spatial modulation of its inputs and outputs more contextually-relevant feature maps at each computation step. Our experiments indicate that TDAM enhances the performance of CNNs across multiple object-recognition benchmarks and outperforms prominent attention modules while being more parameter and memory efficient. Further, TDAM-based models learn to “shift attention” by localizing individual objects or features at each computation step without any explicit supervision resulting in a 5\% improvement for ResNet50 on weakly-supervised object localization.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the National Research Foundation - AI Singapore Program
Grant Reference no. : AISG-RP-2019-010

This research is supported by core funding from: SERC Central Research Fund
Grant Reference no. : NA

Centre for Frontier AI Research (CFAR) from A*STAR
Description:
This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/978-3-031-19806-9_15
ISSN:
9783031198069
ISBN:
9783031198052
Files uploaded:

File Size Format Action
5443-camera-ready.pdf 4.93 MB PDF Open