重新考虑基于参考的独特图像字幕

论文标题

重新考虑基于参考的独特图像字幕

Rethinking the Reference-based Distinctive Image Captioning

论文作者

Mao, Yangjun, Chen, Long, Jiang, Zhihong, Zhang, Dong, Zhang, Zhimeng, Shao, Jian, Xiao, Jun

论文摘要

在过去的几年中，引起了独特的图像字幕（DIC）（DIC）（DIC）产生了独特的字幕，以描述目标图像的独特细节。最近的DIC工作建议通过将目标图像与一组语义相似的参考图像（即基于参考的DIC（REF-DIC））进行比较来生成独特的字幕。它的目的是使生成的字幕可以分开目标和参考图像。不幸的是，现有参考作品使用的参考图像很容易区分：这些参考图像仅在场景级别上类似于目标图像，并且几乎没有常见的对象，因此Ref-DiC模型即使不考虑参考图像也可以微不足道地生成独特的字幕。为了确保Ref-DIC模型真正了解目标图像中的唯一对象（或属性），我们首先提出了两个新的Ref-DIC基准。具体而言，我们设计了一个两阶段的匹配机制，该机制严格控制目标 - /属性级别（与场景级别）处的目标和参考图像之间的相似性。其次，为了产生独特的字幕，我们开发了一个强大的基于变压器的ref-DIC基线，称为传播。它不仅从目标图像中提取视觉特征，而且还编码目标图像和参考图像中对象之间的差异。最后，为了获得更值得信赖的基准测试，我们提出了一个新的评估度量指标，名为Ref-DIC的Discider，该指标评估了生成的字幕的准确性和独特性。实验结果表明，我们的传递可以产生独特的标题。此外，在不同指标上的两个新基准测试中，它的表现优于几个最先进的模型。

Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题