论文标题

clip4IDC:图像差字段的剪辑

CLIP4IDC: CLIP for Image Difference Captioning

论文作者

Guo, Zixin, Wang, Tzu-Jui Julius, Laaksonen, Jorma

论文摘要

图像差异字幕(IDC)旨在生成句子来描述两个相似图像之间的差异。传统方法学习具有预训练且通常冷冻的视觉特征提取器的IDC模型。因此,可能会出现两个主要问题:(1)通常存在用于训练这样的视觉编码器和下游IDC任务的预训练数据集之间的大域间隙,以及(2)分别编码两个图像时,通常不会有效地编码两个图像之间的可视化图像。由于最近提出的剪辑的出色零击性能,我们建议clip4IDC转移IDC任务的剪辑模型以解决这些问题。不同于直接进行微调剪辑来生成句子,我们引入了一个适应训练过程,以根据文本描述来调整剪辑的视觉编码器,以捕获和对齐图像对中的差异。在三个IDC基准数据集,CLEVR-CHANGE,SPOT-THE-DIFF和图像编辑重试的实验中,证明了Clip4IDC的有效性。

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP's visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源