论文标题
VL-Taboo:对视觉模型的基于属性的零击功能的分析
VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models
论文作者
论文摘要
自从出现以来,在大型,随机收集的数据上接受培训的视觉模型在许多领域都有重大影响。但是,由于它们在各个领域(例如图像文本 - 取回)表现出色时,他们的内部工作仍未得到充分了解。当前的工作分析了这些模型的真实零拍功能。我们从对培训语料库的分析开始,评估测试类的程度(以及哪个)实际上是零射击,以及与单个类别的性能相关联。我们跟进这些模型的基于属性的零击学习能力的分析,并评估了这种经典的零击概念从大规模的监督中出现的方式。我们利用最近发布的LAION400M数据语料库以及公开可用的剪辑,OpenClip和Flava的模型,评估了基于属性的CUB和AWA2基准的零摄像机功能。我们的分析表明:(i)在预训练期间(很多)观察到大多数流行零射基测试基准中的大多数类; (ii)零射击性能主要来自模型识别类标签的能力,每当它们存在于文本中时,并且只有在不使用类标签时才能观察到基于属性的Zeroshot学习的较低的性能能力; (iii)所使用的属性数量可能会对性能产生重大影响,并且很容易导致性能大幅下降。
Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to what extent (and which of) the test classes are really zero-shot and how this correlates with individual classes performance. We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision. We leverage the recently released LAION400M data corpus as well as the publicly available pretrained models of CLIP, OpenCLIP, and FLAVA, evaluating the attribute-based zero-shot capabilities on CUB and AWA2 benchmarks. Our analysis shows that: (i) most of the classes in popular zero-shot benchmarks are observed (a lot) during pre-training; (ii) zero-shot performance mainly comes out of models' capability of recognizing class labels, whenever they are present in the text, and a significantly lower performing capability of attribute-based zeroshot learning is only observed when class labels are not used; (iii) the number of the attributes used can have a significant effect on performance, and can easily cause a significant performance decrease.