论文标题
具有多模式正则化的变压器解码器,用于跨模式食品检索
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
论文作者
论文摘要
近年来,跨模式图像配方的检索引起了人们的重大关注。大多数工作着重于使用单峰编码器改进跨模式嵌入,从而可以在大规模数据库中有效检索,从而在计算上更昂贵的模式之间抛弃了跨注意。我们提出了一个新的检索框架,T-FOOD(具有多模式正则化的变压器解码器,用于跨模式食物检索),该框架在新颖的正则化方案中利用了模态之间的相互作用,同时仅在测试时间仅使用非模态编码器来有效检索。我们还捕获了具有专用食谱编码器的配方实体之间的依赖性,并提出了具有适应任务难度的动态边缘的三胞胎损失的新变体。最后,我们利用了最近的视觉和语言预处理(VLP)模型(例如图像编码器的剪辑)的力量。我们的方法在配方1M数据集上的大幅度优于现有方法。具体来说,我们在1K和10K测试集上分别实现了8.1%(72.6 R@1)和+10.9%(44.6 R@1)的绝对改进。该代码可在此处找到:https://github.com/mshukor/tfood
Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood