WUDAOMM：用于预训练模型的大规模多模式数据集

论文标题

WUDAOMM：用于预训练模型的大规模多模式数据集

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

论文作者

Yuan, Sha, Zhao, Shuai, Leng, Jiahong, Xue, Zhao, Zhao, Hanyu, Liu, Peiyu, Gong, Zheng, Zhao, Wayne Xin, Li, Junyi, Tang, Jie

论文摘要

与域特异性模型相比，视觉语言预训练模型（VLPM）在具有快速调整过程的下游任务上表现出卓越的性能。 For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of大型培训数据。但是，由于数据分布（包括大小，任务类型和质量）的不平衡，使用多个数据集的混合物进行模型培训可能是有问题的。在这项工作中，我们引入了一个名为Wudaomm的大规模多模式语料库，完全包含超过650m的图像文本对。具体而言，从多个网页中收集了大约6亿对数据，其中图像和字幕呈现较弱的相关性，而其他5000万强相关图像文本对也是从一些高质量的图形网站收集的。我们还发布了Wudaomm的基本版本，其中有500万个强相关图像文本对，足以支持常见的跨模型预训练。此外，我们培训了一种理解和一代视觉语言（VL）模型来测试数据集效果。结果表明，Wudaomm可以作为VLPM的有效数据集应用，尤其是对于文本到图像生成任务的模型。数据将在https://data.wudaoai.cn上发布

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn

下载PDF全文

下载文献需遵守相关版权规定

论文标题