文本对图像综合的有效神经体系结构

论文标题

文本对图像综合的有效神经体系结构

Efficient Neural Architecture for Text-to-Image Synthesis

论文作者

Souza, Douglas M., Wehrmann, Jônatas, Ruiz, Duncan D.

论文摘要

文本对图像合成是从文本描述中生成图像的任务。图像生成本身是一项具有挑战性的任务。当我们结合图像生成和文本时，我们将复杂性提高到一个新的水平：我们需要结合两个不同方式的数据。在神经体系结构方面，大多数文本对图像综合的作品都遵循类似的方法。由于上述困难以及高分辨率训练gan的固有困难，大多数方法都采用了多阶段培训策略。在本文中，我们改变了目前用于文本图像方法的建筑范例，并表明有效的神经体系结构可以使用单个发电机和单个歧视器的单个阶段培训来实现最先进的性能。我们通过应用深层剩余网络以及一种新颖的句子插值策略来做到这一点，从而可以学习一个平稳的条件空间。最后，我们的工作指出了文本到图像研究的新方向，该研究最近没有尝试过新颖的神经体系结构。

Text-to-image synthesis is the task of generating images from text descriptions. Image generation, by itself, is a challenging task. When we combine image generation and text, we bring complexity to a new level: we need to combine data from two different modalities. Most of recent works in text-to-image synthesis follow a similar approach when it comes to neural architectures. Due to aforementioned difficulties, plus the inherent difficulty of training GANs at high resolutions, most methods have adopted a multi-stage training strategy. In this paper we shift the architectural paradigm currently used in text-to-image methods and show that an effective neural architecture can achieve state-of-the-art performance using a single stage training with a single generator and a single discriminator. We do so by applying deep residual networks along with a novel sentence interpolation strategy that enables learning a smooth conditional space. Finally, our work points a new direction for text-to-image research, which has not experimented with novel neural architectures recently.

下载PDF全文

下载文献需遵守相关版权规定

论文标题