论文标题
对比视力语言模型中的感知分组
Perceptual Grouping in Contrastive Vision-Language Models
论文作者
论文摘要
零拍图像识别的最新进展表明,视觉模型以高度的语义信息学习通用的视觉表示,可以用自然语言短语进行任意探索。但是,了解图像不仅在于了解图像中的内容存在于图像中,而且重要的是,该内容所在的位置。在这项工作中,我们研究了视觉模型如何能够了解对象在图像中的位置,并将图像的视觉相关部分组合在一起。我们演示了基于对比损失和基于Web的大型数据捕获有限的对象本地化信息的当代视野和语言表示学习模型。我们提出了一套最小的修改集,这些修改导致模型唯一学习语义和空间信息。我们以零拍图像识别,无监督的自下而上和自上而下的语义分割以及健壮性分析来衡量这种性能。我们发现,由此产生的模型在无监督的分割方面实现了最新的结果,并证明学习的表示形式对于旨在探测视觉模型的因果行为的数据集中的虚假相关性是独特的。
Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.