Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Page view(s)
11
Checked on Feb 27, 2025
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Title:
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Journal Title:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Keywords:
Publication Date:
12 April 2024
Citation:
Chen, S., Zhu, H., Li, M., Chen, X., Guo, P., Lei, Y., YU, G., Li, T., & Chen, T. (2024). Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–17. https://doi.org/10.1109/tpami.2024.3387838
Abstract:
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated “detect-then-describe” pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features. Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance. We also insert additional spatial information to the caption head for more accurate descriptions. Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional “detect-then-describe” methods by a large margin. We have made the code available at https://github.com/ch3cook-fdu/Vote2Cap-DETR.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - MTC Programmatic Fund
Grant Reference no. : A18A2b0046

This research / project is supported by the National Natural Science Foundation of China - N/A
Grant Reference no. : 62071127, 62101137

This research / project is supported by the National Key Research and Development Program of China - N/A
Grant Reference no. : 2022ZD0160100

This research / project is supported by the Shanghai Natural Science Foundation - N/A
Grant Reference no. : 23ZR1402900

This research / project is supported by the RobotHTPO Seed Fund - N/A
Grant Reference no. : C211518008

This research / project is supported by the EDB - Space Technology Development Grant
Grant Reference no. : S22-19016- STDP
Description:
© 2024 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
0162-8828
Files uploaded:

File Size Format Action
t-pami-revision-vote2cap-detr-3.pdf 355.27 KB PDF Request a copy