大规模生成模型会破坏未来数据集吗？

论文标题

大规模生成模型会破坏未来数据集吗？

Will Large-scale Generative Models Corrupt Future Datasets?

论文作者

Hataya, Ryuichiro, Bao, Han, Arai, Hiromi

论文摘要

最近提出的大规模文本到图像生成模型，例如DALL $ \ CDOT $ E 2，MIDJOURNEY和Stablediffusion可以从用户的提示中生成高质量和现实的图像。普通的互联网用户不仅限于研究社区，因此享受这些生成模型，因此，在互联网上共享了大量生成的图像。同时，当今在计算机视野领域进行深度学习的成功归功于从互联网收集的图像。这些趋势使我们提出了一个研究问题：“ \ textbf {此类生成的图像会影响未来数据集的质量以及计算机视觉模型的表现积极或负面吗？}”本文通过模拟污染来经验回答这个问题。即，我们使用最先进的生成模型生成Imagenet规模和可可级数据集，并评估在各种任务上使用“受污染”数据集训练的模型，包括图像分类和图像生成。在整个实验中，我们得出的结论是，生成的图像对下游性能产生负面影响，而意义取决于任务和生成的图像量。生成的数据集和实验代码将公开发布以供将来的研究。可从\ url {https://github.com/moskomule/dataset-contamination}获得生成的数据集和源代码。

Recently proposed large-scale text-to-image generative models such as DALL$\cdot$E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently, a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "\textbf{will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?}" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained with "contaminated" datasets on various tasks, including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets and the codes for experiments will be publicly released for future research. Generated datasets and source codes are available from \url{https://github.com/moskomule/dataset-contamination}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题