论文标题
学习特征空间中的多模式数据增强
Learning Multimodal Data Augmentation in Feature Space
论文作者
论文摘要
从文本,音频和视觉数据等多种模式中共同学习的能力是智能系统的定义功能。尽管在设计神经网络以利用多模式数据方面取得了希望的进步,但目前数据增强的巨大成功仍然仅限于单模式任务,例如图像分类。确实,在保留数据的整体语义结构的同时,很难增加每种方式。例如,在应用标准增强后(例如翻译)之后,标题可能不再是对图像的很好描述。此外,指定未针对特定方式量身定制的合理转换是一个挑战。在本文中,我们介绍了LEMDA,学习多模式数据增强,这是一种易于使用的方法,它自动学习以在特征空间中共同增强多模式数据,对模式的身份或模式之间的关系没有任何约束。我们表明,LEMDA可以(1)深刻提高多模式深度学习体系结构的性能,(2)适用于以前未曾考虑过的模式的组合,(3)(3)在包括图像,文本和表格数据的各种应用程序上实现最新结果。
The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.