论文标题
IMRAM:迭代匹配与跨模式图像检索的重复注意记忆
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
论文作者
论文摘要
启用图像和文本的双向检索对于理解视觉和语言之间的对应关系很重要。现有方法利用注意机制以细粒度的方式探索这种对应关系。但是,他们中的大多数人都平等地考虑所有语义,因此无论其多样化的复杂性如何,都统一地对齐它们。实际上,语义是多种多样的(即涉及不同种类的语义概念),并且人类通常遵循潜在的结构,将它们结合到可理解的语言中。在现有方法中可能很难最佳地捕获这种复杂的对应关系。在本文中,为了解决这种缺陷,我们提出了与重复注意记忆(IMRAM)方法的迭代匹配,其中图像和文本之间的对应关系被多个对齐步骤捕获。具体而言,我们引入了一种迭代匹配方案,以逐步探索这种细粒度的对应关系。内存蒸馏单元用于从早期步骤到较晚的步骤来完善对齐知识。实验在三个基准数据集(即FlickR8K,Flickr30k和MS Coco)上结果表明,我们的IMRAM实现了最先进的性能,很好地证明了其有效性。名为\ ads {}的实用业务广告数据集的实验进一步验证了我们方法在实际情况下的适用性。
Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named \Ads{}, further validates the applicability of our method in practical scenarios.