跨模式检索和合成（X-MRS）：在共享表示学习中缩小模式差距

论文标题

跨模式检索和合成（X-MRS）：在共享表示学习中缩小模式差距

Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning

论文作者

Guerrero, Ricardo, Pham, Hai Xuan, Pavlovic, Vladimir

论文摘要

计算食品分析（CFA）自然需要特定食物的多模式证据，例如图像，食谱文本等。使CFA成为可能的关键是多模式的共享表示学习，该学习旨在创建数据的多个视图（文本和图像）的联合表示。在这项工作中，我们提出了一种食品领域跨模式共享表示学习的方法，以保留食物数据中存在的巨大语义丰富度。我们提出的方法采用了有效的基于变压器的多语言配方编码器，再加上传统的图像嵌入体系结构。在这里，我们建议使用不完美的多语言翻译来有效地正规化模型，同时添加跨多种语言和字母的功能。对公共配方1M数据集的实验分析表明，通过提出的方法学到的表示形式在检索任务上显着优于当前最新技术（SOTA）。此外，通过以配方嵌入为条件的生成食品图像合成模型证明了学习表示的代表力。合成的图像可以有效地重现配对样品的视觉外观，表明学习的表示形式捕获了文本配方及其视觉内容的关节语义，从而缩小了模态差距。

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.

下载PDF全文

下载文献需遵守相关版权规定

论文标题