crepe：视觉语言基础能否在构图上模型？

论文标题

crepe：视觉语言基础能否在构图上模型？

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

论文作者

Ma, Zixian, Hong, Jerry, Gul, Mustafa Omer, Gandhi, Mona, Gao, Irena, Krishna, Ranjay

论文摘要

人类视力和自然语言共有的基本特征是它们的组成性质。然而，尽管表现出色，这是由于远见和语言预处理的贡献，但我们发现：在大量数据集中使用4种算法训练的7种体系结构中，它们在构图中挣扎。为了得出这一结论，我们引入了一种新的组成性评估基准Crepe，该基准测量了认知科学文献确定的组成性的两个重要方面：系统性和生产力。为了衡量系统性，Crepe由一个测试数据集组成，该数据集包含$ 370K $映像 - 文本对和三种不同的看不见的分裂。这三个拆分旨在测试在三个受欢迎的培训数据集上训练的模型：CC-12M，YFCC-15M和Laion-400m。我们还产生了$ 325K $，$ 316K $和$ 309K $的硬性标题。为了测试生产力，Crepe包含$ 17K $ Image-Text对，具有九种不同的复杂性，以及$ 183K $硬负字幕，带有原子，交换和否定箔。通过重新利用视觉基因组场景图和区域描述以及应用手工制作的模板和GPT-3来生成数据集。对于系统性，我们发现当新颖的构图主导检索集时，模型性能会持续下降，召回@1下降到$ 12 \％$。为了生产力，随着复杂性的增加，模型的检索成功逐渐消失，通常会在高复杂性下接近随机的机会。这些结果不管模型和培训数据集大小如何。

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题