CAP2AUG：字幕引导图像图像数据增加

论文标题

CAP2AUG：字幕引导图像图像数据增加

Cap2Aug: Caption guided Image to Image data Augmentation

论文作者

Roy, Aniket, Shah, Anshul, Shah, Ketul, Roy, Anirban, Chellappa, Rama

论文摘要

低数据制度中的视觉识别具有挑战性，通常容易过度拟合。为了减轻此问题，已经提出了几种数据增强策略。但是，标准转换，例如旋转，裁剪和翻转提供有限的语义变化。为此，我们建议使用图像标题作为文本提示，这是一种基于图像到图像扩散模型的数据增强策略。我们使用图像到图像稳定的扩散模型从有限的训练图像中生成字幕，并使用这些字幕编辑训练图像，以生成语义上有意义的增强。该策略生成了与训练图像相似的图像的增强版本，但在样品中提供了语义多样性。我们表明，可以用字幕捕获类中的变化，然后翻译以使用由字幕引导的图像到图像扩散模型生成不同的样本。但是，由于真实图像和合成图像之间的域间隙，对合成图像的幼稚学习不足。因此，我们采用最大平均差异（MMD）损失，将合成图像与真实图像对齐，以最大程度地减少域间隙。我们在几乎没有射击和长尾分类任务上评估了我们的方法，并对最先进的绩效进行改进，尤其是在低数据制度中。

Visual recognition in a low-data regime is challenging and often prone to overfitting. To mitigate this issue, several data augmentation strategies have been proposed. However, standard transformations, e.g., rotation, cropping, and flipping provide limited semantic variations. To this end, we propose Cap2Aug, an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts. We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model to generate semantically meaningful augmentations. This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples. We show that the variations within the class can be captured by the captions and then translated to generate diverse samples using the image-to-image diffusion model guided by the captions. However, naive learning on synthetic images is not adequate due to the domain gap between real and synthetic images. Thus, we employ a maximum mean discrepancy (MMD) loss to align the synthetic images to the real images for minimizing the domain gap. We evaluate our method on few-shot and long-tail classification tasks and obtain performance improvements over state-of-the-art, especially in the low-data regimes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题