论文标题

部分可观测时空混沌系统的无模型预测

A Frustratingly Simple Approach for End-to-End Image Captioning

论文作者

Luo, Ziyang, Xi, Yadong, Zhang, Rongsheng, Ma, Jing

论文摘要

图像字幕是加入视觉和语言的一项基本任务,涉及跨模式的理解和文本生成。近年来见证了对图像字幕的新出现的关注。现有的大多数作品都遵循传统的两阶段训练范式。在训练字幕模型之前,使用额外的对象检测器首先识别图像中的对象。但是,它们需要具有细粒对象注释的大量数据集来训练对象检测器,这是一项艰巨的任务。此外,对象探测器的错误很容易传播到以下字幕模型,从而使模型的性能退化。为了减轻此类缺陷,我们通过连接预先训练的视觉编码器(clip-vit)和语言解码器(GPT2),提出了一个令人沮丧的简单但高效的端到端图像字幕字幕(VC-GPT)。与将交叉意见模块直接插入GPT2中的香草连接方法不同,我们提出了一种自动化的交叉模式融合机制,可以全面考虑单型和交叉模式知识。结果,我们不需要额外的对象探测器进行模型培训。在三个流行的图像字幕基准(MSCOCO,FLICKR30K和NOCAPS)上进行的实验结果表明,我们的VC-GPT在广泛的基线系统上实现了所有评估指标的最佳或第二好的性能。

Image Captioning is a fundamental task to join vision and language, concerning about cross-modal understanding and text generation. Recent years witness the emerging attention on image captioning. Most of existing works follow a traditional two-stage training paradigm. Before training the captioning models, an extra object detector is utilized to recognize the objects in the image at first. However, they require sizeable datasets with fine-grained object annotation for training the object detector, which is a daunting task. In addition, the errors of the object detectors are easy to propagate to the following captioning models, degenerating models' performance. To alleviate such defects, we propose a frustratingly simple but highly effective end-to-end image captioning framework, Visual Conditioned GPT (VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT) and language decoder (GPT2). Different from the vanilla connection method that directly inserts the cross-attention modules into GPT2, we come up with a self-ensemble cross-modal fusion mechanism that comprehensively considers both the single- and cross-modal knowledge. As a result, we do not need extra object detectors for model training. Experimental results conducted on three popular image captioning benchmarks (MSCOCO, Flickr30k and NoCaps) demonstrate that our VC-GPT achieves either the best or the second-best performance across all evaluation metrics over extensive baseline systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源