SmallCap：轻巧的图像字幕，并通过检索增强提示

论文标题

SmallCap：轻巧的图像字幕，并通过检索增强提示

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

论文作者

Ramos, Rita, Martins, Bruno, Elliott, Desmond, Kementchedjhieva, Yova

论文摘要

图像字幕的最新进展重点是扩展数据和模型大小，从而大大增加了预训练和填充的成本。作为大型模型的替代方法，我们提出了SmallCap，该cap在输入图像和从数据存储中检索的相关字幕生成字幕。我们的模型轻巧训练，因为唯一学习的参数是在预先训练的夹子编码器和GPT-2解码器之间新引入的跨注意层中。 SmallCap可以不用额外的填充而转移到新域，并且可以以无培训方式利用大规模数据，因为数据存储的内容很容易被更换。我们的实验表明，仅在可可训练的基准上接受了竞争性能，并且还通过从目标域数据中检索而在不进行重新训练的情况下转移到其他域，因此SmallCAP在此基准上具有竞争性能。通过对多样化的人类标记和Web数据进行无训练的剥削，可以实现进一步的改进，这对于包括NOCAPS基准在内的一系列领域有效，旨在测试概括以测试看不见的视觉概念。

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train, as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and can exploit large-scale data in a training-free fashion since the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves to be effective for a range of domains, including the nocaps benchmark, designed to test generalization to unseen visual concepts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题