Armani：统一跨模式时装设计的零件级服装 - 文本对齐

论文标题

Armani：统一跨模式时装设计的零件级服装 - 文本对齐

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

论文作者

Zhang, Xujie, Sha, Yu, Kampffmeyer, Michael C., Xie, Zhenyu, Jie, Zequn, Huang, Chengwen, Peng, Jianqing, Liang, Xiaodan

论文摘要

跨模式时尚图像综合已成为一代域中最有希望的方向之一，因为巨大的未开发的潜力融合了多种方式和广泛的时尚图像应用。为了促进准确的生成，跨模式合成方法通常依赖于对比的语言图像预训练（剪辑）来使文本和服装信息对齐。在这项工作中，我们认为，简单地对齐纹理和服装信息不足以捕获视觉信息的语义，因此提出了maskClip。 MaskClip将服装分解为语义部分，以确保视觉和文本信息之间的细粒度和语义准确对齐。在MaskClip上，我们建议Armani，这是一位统一的跨模式时装设计师，具有零件级的服装 - 文本对齐。 Armani在第一阶段将图像基于学习的跨模式代码簿将图像离散为统一令牌，并使用变压器在第二阶段的控制信号的标记中使用Transformer为真实图像的图像令牌分布建模。与同样依赖两阶段范式的先前方法相反，Armani将文本令牌引入了代码簿中，使该模型可以利用细粒语义信息来生成更真实的图像。此外，通过引入跨模式变压器，Armani具有通用性，可以从各种控制信号（例如纯文本，草图图像和部分图像）中完成图像合成。在我们新收集的跨模式时尚数据集上进行的广泛实验表明，Armani在多样化的合成任务中生成照片现实的图像，并且胜过现有的现有的最新跨模式图像综合方法。

Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain due to the vast untapped potential of incorporating multiple modalities and the wide range of fashion image applications. To facilitate accurate generation, cross-modal synthesis methods typically rely on Contrastive Language-Image Pre-training (CLIP) to align textual and garment information. In this work, we argue that simply aligning texture and garment information is not sufficient to capture the semantics of the visual information and therefore propose MaskCLIP. MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information. Building on MaskCLIP, we propose ARMANI, a unified cross-modal fashion designer with part-level garment-text alignment. ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage. Contrary to prior approaches that also rely on two-stage paradigms, ARMANI introduces textual tokens into the codebook, making it possible for the model to utilize fine-grain semantic information to generate more realistic images. Further, by introducing a cross-modal Transformer, ARMANI is versatile and can accomplish image synthesis from various control signals, such as pure text, sketch images, and partial images. Extensive experiments conducted on our newly collected cross-modal fashion dataset demonstrate that ARMANI generates photo-realistic images in diverse synthesis tasks and outperforms existing state-of-the-art cross-modal image synthesis approaches.Our code is available at https://github.com/Harvey594/ARMANI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题