论文标题

部分可观测时空混沌系统的无模型预测

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

论文作者

Pramanick, Shraman, Jing, Li, Nag, Sayan, Zhu, Jiachen, Shah, Hardik, LeCun, Yann, Chellappa, Rama

论文摘要

视觉语言预训练(VLP)最近已证明对各种单型和多模式下游应用非常有效。但是,大多数现有的端到端VLP方法都使用高分辨率图像文本框数据在细粒度的区域级任务上表现良好,例如对象检测,分割和引用表达式理解。不幸的是,这种具有准确边界框注释的高分辨率图像的收集和用于大规模监督的昂贵。在这项工作中,我们提出了Volta(具有弱监督的局部功能对齐方式的视觉变压器),这是一种新的VLP范式,仅利用图像捕获数据,但可以实现良好的区域级图像的理解,从而消除了昂贵的盒子注释的用途。 Volta在本地图像贴片和文本代币上采用了基于图形的最佳传输弱监督对准,以发明明确,自称和可解释的低级匹配标准。此外,Volta在预训练期间将多模式融合推入Uni-Modal骨架深处,并去除融合特异性的变压器层,从而进一步降低了内存需求。对多种视觉和视觉下游任务进行的广泛实验证明了伏尔塔在细粒度应用中的有效性,而不会损害粗糙的下游性能,通常使用明显更多的字幕和盒子注释的方法优于表现的方法。

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源