字幕监督可以实现强大的学习者

论文标题

字幕监督可以实现强大的学习者

Caption supervision enables robust learners

论文作者

Feuer, Benjamin, Joshi, Ameya, Hegde, Chinmay

论文摘要

视觉语言（VL）模型（例如剪辑）对于自然分配的变化是可靠的，部分原因是剪辑使用称为标题监督的技术学习非结构化数据；模型Inteprets图像链接的文本作为地面真实标签。在经过仔细控制的比较研究中，我们表明，对标准的跨透镜损失训练的标题监督的CNN（其中由扫描字幕分配给班级名称的图像标签）比对相同数据训练的VL模型表现出更大的分布鲁棒性。为了促进使用高精度字幕的未来实验，我们介绍了Captionnet（https://github.com/penfever/captionnet/），其中包括一个包含超过50,000多个新的Human-Babelet Imagenet-Compliant Samples的类别，全面监督的数据集，其中包括网络craped片，其中包括Web-Scraped Cappients。在CaptionNet的一系列实验中，我们展示了损失功能，数据过滤和监督策略的选择如何实现强大的计算机视觉。我们还提供了在VL Hub（https://github.com/penfever/vlhub/）上复制实验所需的代码库。

Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub (https://github.com/penfever/vlhub/).

下载PDF全文

下载文献需遵守相关版权规定

论文标题