ULIP：学习语言，图像和点云的统一表示3D理解

论文标题

ULIP：学习语言，图像和点云的统一表示3D理解

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

论文作者

Xue, Le, Gao, Mingfei, Xing, Chen, Martín-Martín, Roberto, Wu, Jiajun, Xiong, Caiming, Xu, Ran, Niebles, Juan Carlos, Savarese, Silvio

论文摘要

当前最新3D模型的识别能力受到数据集的限制，这些数据集具有少量注释数据和一组预定义的类别。最近的进步在其2D对应物中表明，通过采用来自其他方式（例如语言）的知识，可以显着缓解类似的问题。受到这一点的启发，利用多模式信息的3D模式可能有望在受限的数据制度下提高3D理解，但是对这一研究的研究尚未得到很好的研究。因此，我们介绍ULIP，通过预先训练三种模式的对象三胞胎来学习图像，文本和3D点云的统一表示。为了克服训练三胞胎的短缺，ULIP利用了预先训练的视觉模型，该模型已经通过使用大量图像文本对训练来学习了常见的视觉和文本空间。然后，使用少数自动合成的三胞胎，ULIP学习一个与通用图像底交空间对齐的3D表示空间。 ULIP对3D骨干网络不可知，可以轻松地集成到任何3D体系结构中。实验表明，ULIP通过简单地使用我们的框架对Shapenet55进行预训练，从而有效地提高了多个最近的3D主链的性能，从而在标准3D分类中实现了最先进的性能，并且在ModelNet40和ScanObjectnn上都可以在标准3D分类和零射击3D分类中进行最新性能。 ULIP在Scanobjectnn的3D分类中还将尖端MLP的性能提高了约3％，而对于ModelNet40上的零摄像机3D分类，TOP-1的精度上的PointClip优于28.8％。我们的代码和预培训模型在https://github.com/salesforce/ulip上发布。

The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题