论文标题
loopitr:组合图像文本检索的双重和交叉编码器架构
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
论文作者
论文摘要
双重编码器和交叉编码器已被广泛用于图像文本检索。在两者之间,双重编码器编码图像和文本,然后独立于点产品,而交叉编码器将图像和文本共同馈送为输入,并执行密集的多模式融合。这两个体系结构通常在没有相互作用的情况下单独建模。在这项工作中,我们提出了Loopitr,将它们结合在同一网络中以进行联合学习。具体而言,我们让双重编码器向交叉编码器提供艰苦的负面因素,并使用更具歧视性的交叉编码器将其预测提炼回双重编码器。这两个步骤均在同一模型中有效地执行。我们的工作集中在这种结合体系结构的经验分析上,主要关注蒸馏目标的设计。我们的实验结果突出了在同一网络中训练两个编码器的好处,并证明只有几个硬否定例子就可以非常有效。与使用类似数据的方法相比,在两个标准数据集(FlickR30K和可可)上的实验显示,我们的方法可实现最先进的双重编码器性能。
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective with just a few hard negative examples. Experiments on two standard datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual encoder performance when compared with approaches using a similar amount of data.