多模式蒙版自动编码器学习可转移表示形式

论文标题

多模式蒙版自动编码器学习可转移表示形式

Multimodal Masked Autoencoders Learn Transferable Representations

论文作者

Geng, Xinyang, Liu, Hao, Lee, Lisa, Schuurmans, Dale, Levine, Sergey, Abbeel, Pieter

论文摘要

构建可扩展模型以从不同的多模式数据中学习仍然是一个开放的挑战。对于视力语言数据，主要方法基于对比对比学习目标，该目标训练每种模式的单独编码器。虽然有效，但对比度学习方法会根据所使用的数据增加引入采样偏差，这可能会在下游任务上降低性能。此外，这些方法仅限于配对的图像文本数据，并且不能利用广泛可用的未配对数据。在本文中，我们研究了一个大型多模型模型是否仅通过掩盖的令牌预测训练，而无需使用模式特异性编码器或对比度学习，可以学习下游任务的可转移表示形式。我们提出了一个简单且可扩展的网络体系结构，即多模式掩盖自动编码器（M3AE），该架构通过掩盖的令牌预测来学习视觉和语言数据的统一编码器。我们提供了在大规模图像文本数据集上训练的M3AE的经验研究，并发现M3AE能够学习可及时转移到下游任务的可通用表示。令人惊讶的是，我们发现M3AE受益于较高的文本掩码比（50-90％），而与BERT的标准掩蔽比为15％，这是由于两种数据模式的联合培训。我们还提供定性分析表明，学识渊博的表示形式包含了来自图像和语言的有意义的信息。最后，我们证明了具有较大模型大小和训练时间的M3AE的可伸缩性，以及对配对的图像文本数据以及未配对数据进行训练的灵活性。

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题