通过自适应汇总和优化目标改善视觉语义嵌入

论文标题

通过自适应汇总和优化目标改善视觉语义嵌入

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

论文作者

Zhang, Zijian, Shu, Chang, Xiao, Ya, Shen, Yuan, Zhu, Di, Xiao, Jing, Chen, Youxin, Lau, Jey Han, Zhang, Qian, Lu, Zheng

论文摘要

视觉语义嵌入（VSE）旨在学习一个相关的视觉和语义实例彼此接近的嵌入空间。最近的VSE模型倾向于设计复杂的结构，以将视觉和语义特征汇集到固定长度向量中，并使用硬三重态损失进行优化。但是，我们发现：（1）结合简单的合并方法并不比这些复杂的方法差；（2）仅考虑最难散布的负样本会导致收敛缓慢，而回忆不佳@k改进。为此，我们提出了一种自适应合并策略，该策略使模型可以通过简单合并方法的组合来学习如何汇总特征。我们还引入了一种动态选择一组负样本的策略，以使优化更快地收敛并表现更好。 FlickR30K和MS-Coco的实验结果表明，使用我们的合并和优化策略的标准VSE在图像到文本和文本对图像检索中的汇总和优化策略的表现优于当前最新系统（召回指标的1.0％）。我们的实验的源代码可在https://github.com/96-zachary/vse_2ad上获得。

Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.

下载PDF全文

下载文献需遵守相关版权规定

论文标题