论文标题
SAC:文本条件图像检索的语义注意组成
SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval
论文作者
论文摘要
有效搜索图像的能力对于改善各种产品的用户体验至关重要。通过多模式输入合并用户反馈来导航视觉搜索可以帮助量身定制到特定用户查询的结果。我们专注于文本条件图像检索的任务,该任务利用了支持文本反馈以及参考图像的图像来检索同时满足两个输入施加的约束的图像。该任务是具有挑战性的,因为它需要通过从文本反馈中结合多个跨粒度语义编辑,然后将其应用于视觉功能,从而学习复合图像文本。为了解决这个问题,我们提出了一个新颖的框架SAC,该框架以两个主要步骤解决以上的框架:“在哪里看到”(语义特征注意)和“如何更改”(语义特征修改)。我们从系统地展示我们的体系结构如何通过消除对其他最先进技术所需的各种模块的需求来简化文本感知图像特征的生成。我们提出了广泛的定量,定性分析和消融研究,以表明我们的建筑SAC通过在3个基准数据集上实现最先进的性能来优于现有技术:FashionIQ,鞋子,鞋子和单词,同时支持不同长度的自然语言反馈。
The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths.