P2T: Pyramid Pooling Transformer for Scene Understanding

Page view(s)
88
Checked on Apr 11, 2024
P2T: Pyramid Pooling Transformer for Scene Understanding
Title:
P2T: Pyramid Pooling Transformer for Scene Understanding
Journal Title:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Publication Date:
30 August 2022
Citation:
Wu, Y.-H., Liu, Y., Zhan, X., & Cheng, M.-M. (2022). P2T: Pyramid Pooling Transformer for Scene Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–12. https://doi.org/10.1109/tpami.2022.3202765
Abstract:
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funds
Grant Reference no. : A1892b0026

This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - AME Programmatic Funds
Grant Reference no. : A19E3b0099

This work is supported in part by Major Project for New Generation of AI under Grant No. 2018AAA0100400, in part by NSFC under Grant No. 61922046, in part by Alibaba Innovative Research (AIR) Program.
Description:
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
2160-9292
1939-3539
0162-8828
Files uploaded:

File Size Format Action
21pami-p2t.pdf 691.92 KB PDF Request a copy