X-Trans2CAP：使用变压器进行3D密集字幕的跨模式知识传输

论文标题

X-Trans2CAP：使用变压器进行3D密集字幕的跨模式知识传输

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

论文作者

Yuan, Zhihao, Yan, Xu, Liao, Yinghong, Guo, Yao, Li, Guanbin, Li, Zhen, Cui, Shuguang

论文摘要

3D密集的字幕旨在在3D场景中通过自然语言描述单个对象，其中3D场景通常表示为RGB-D扫描或点云。但是，只有利用单个模态信息，例如点云，以前的方法无法产生忠实的描述。尽管将2D特征汇集到点云可能是有益的，但它引入了额外的计算负担，尤其是在推理阶段。在这项研究中，我们研究了使用变压器进行3D密集字幕X-Trans2CAP的跨模式知识转移，以通过使用教师学生框架通过知识蒸馏有效地提高单模式3D字幕的性能。在实践中，在培训阶段，教师网络利用了辅助2D模式，并指导学生网络仅通过功能一致性约束将点云作为输入。由于精心设计的跨模式融合模块和训练阶段的特征对齐，X-Trans2CAP可以轻松地获取嵌入2D图像中的丰富外观信息。因此，只有在推断期间使用点云才能生成更忠实的标题。定性和定量结果证实，X-Trans2CAP在扫描仪和NR3D数据集中分别以较大的边距（即大约+21和+16绝对苹果酒得分）优于先前的最先前。

3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题