从模态共享的对比语言图像预训练中学习视觉表示

论文标题

从模态共享的对比语言图像预训练中学习视觉表示

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

论文作者

You, Haoxuan, Zhou, Luowei, Xiao, Bin, Codella, Noel, Cheng, Yu, Xu, Ruochen, Chang, Shih-Fu, Yuan, Lu

论文摘要

大规模的多模式对比预训练已经证明了通过将多种模式映射到共享嵌入空间中的一系列下游任务的可转移功能。通常，这对每种方式都采用了单独的编码器。但是，最近的工作表明，变形金刚可以支持跨多种方式学习并允许知识共享。受此启发，我们研究了各种模式共享的对比语言图像预训练（MS-CLIP）框架。更具体地说，我们质疑在对比预训练期间可以在跨模态共享变压器模型的多少个参数，并严格检查建筑设计选择，以将沿频谱共享的参数比例定位。在研究的条件下，我们观察到，视觉和语言信号的大多数统一编码器的表现优于将更多参数分开的所有其他变体。此外，我们发现特定于特定于模态的平行模块进一步提高了性能。实验结果表明，所提出的MS-CLIP方法在零摄像机分类中（在YFCC-100M上进行了预训练）中，最多可超过13 \％相对的香草夹，同时支持降低参数。此外，在24个下游视觉任务的集合中，我们的方法在线性探测中优于Vanilla剪辑。此外，我们发现共享参数导致语义概念来自不同方式在嵌入空间中更接近地编码，从而促进了共同的语义结构（例如注意力模式）从语言到视觉的转移。代码可在\ href {https://github.com/hxyou/msclip} {url}中获得。

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{https://github.com/Hxyou/MSCLIP}{URL}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题