论文标题

深度多模式神经建筑搜索

Deep Multimodal Neural Architecture Search

论文作者

Yu, Zhou, Cui, Yuhao, Yu, Jun, Wang, Meng, Tao, Dacheng, Tian, Qi

论文摘要

设计有效的神经网络在深层多模式学习中至关重要。大多数现有作品都集中在单个任务和手动设计神经体系结构上,这些神经体系结构高度特定于任务,并且很难推广到不同的任务。在本文中,我们为各种多模式学习任务设计了广义的深层神经体系结构搜索(MMNA)框架。给定的多模式输入,我们首先定义一组原始操作,然后构建一个基于深层编码器的统一骨架,每个编码器或解码器块在其中对应于从预定义的操作池中搜索的操作。除了统一的骨干外,我们还附加特定于任务的头部以应对不同的多模式学习任务。通过使用基于梯度的NAS算法,有效地学习了用于不同任务的最佳体系结构。广泛的消融研究,综合分析和比较实验结果表明,所获得的MMNASNET在三个多模式学习任务(超过五个数据集)中的现有最先进的方法(包括视觉询问答案,图像text匹配和视觉接地)显着优于现有的最新方法。

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源