论文标题
与知识图的对比语言图像预训练
Contrastive Language-Image Pre-Training with Knowledge Graphs
论文作者
论文摘要
近年来,大规模训练框架的快速发展可以以统一形式提取多模式表示,并在转移到下游任务时实现有希望的表演。然而,现有方法主要集中于通过简单的图像文本对进行预训练,同时忽略了不同方式的概念之间的语义联系。在本文中,我们提出了一个基于知识的预训练框架,称为知识卷,该框架将语义信息注入了广泛使用的剪辑模型。通过在预训练过程中介绍基于知识的目标并利用不同类型的知识图作为培训数据,我们的模型可以在语义上以更高的质量将视觉和语言的表示形式对齐,并增强各场景和模式的推理能力。与原始夹子和竞争基线相比,对各种视觉下游任务的广泛实验证明了知识夹的有效性。
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities. Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.