End-to-End 3D Dense Captioning with Vote2Cap-DETR

Page view(s)
33
Checked on Mar 05, 2025
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Title:
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Journal Title:
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Keywords:
Publication Date:
22 August 2023
Citation:
Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., & Chen, T. (2023, June). End-to-End 3D Dense Captioning with Vote2Cap-DETR. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr52729.2023.01070
Abstract:
3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated “detect-then-describe” pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield sub-optimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in CIDEr@0.5IoU, respectively. Codes will be released soon.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - MTC Programmatic
Grant Reference no. : A18A2b0046

This research / project is supported by the A*STAR - RobotHTPO
Grant Reference no. : C211518008

This research / project is supported by the Singapore Economic Development Board (EDB) - Space Technology Development Grant (STDP)
Grant Reference no. : S22-19016- STDP

This work is supported by National Natural Science Foundation of China (No. U1909207, 62071127, and 62276176), Shanghai Natural Science Foundation (No. 23ZR1402900), Zhejiang Lab Project (No. 2021KH0AB05)
Description:
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
2575-7075
Files uploaded:

File Size Format Action
chen-end-to-end-3d-dense-captioning-with-vote2cap-detr-cvpr-2023-paper.pdf 1.39 MB PDF Request a copy