Wu, Y.-H., Liu, Y., Zhan, X., & Cheng, M.-M. (2022). P2T: Pyramid Pooling Transformer for Scene Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–12. https://doi.org/10.1109/tpami.2022.3202765
Abstract:
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funds
Grant Reference no. : A1892b0026
This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funds
Grant Reference no. : A19E3b0099
This work is supported in part by Major Project for New Generation of AI under Grant No. 2018AAA0100400, in part by NSFC under Grant No. 61922046, in part by Alibaba Innovative Research (AIR) Program.