TransVG ++：用语言有条件视觉变压器的端到端视觉接地

论文标题

TransVG ++：用语言有条件视觉变压器的端到端视觉接地

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

论文作者

Deng, Jiajun, Yang, Zhengyuan, Liu, Daqing, Chen, Tianlang, Zhou, Wengang, Zhang, Yanyong, Li, Houqiang, Ouyang, Wanli

论文摘要

在这项工作中，我们探索了用于视觉接地的整洁而有效的基于变压器的框架。先前的方法通常解决了视觉接地的核心问题，即具有手动设计的机制，即多模式融合和推理。这样的启发式设计不仅复杂，而且使模型很容易过度拟合特定的数据分布。为了避免这种情况，我们首先提出了TransVG，该TransVG通过变压器建立了多模式的对应关系，并通过直接回归框坐标来定位引用区域。我们从经验上表明，复杂的融合模块可以用具有更高性能的简单变压器编码层代替。但是，TransVG中的核心融合变压器是针对Uni-Modal编码器独立的，因此应在有限的视觉接地数据上从头开始训练，这使得很难优化，并导致优化的性能。为此，我们进一步介绍了TransVG ++以进行两倍的改进。一方面，我们通过利用Vision Transformer（VIT）进行视觉功能编码来将框架升级到一个纯粹的基于变压器的框架。对于另一个人来说，我们设计了语言有条件的视觉变压器，以去除外部融合模块，并重新使用中间层的视觉融合的uni-modal vit。我们对五个普遍数据集进行了广泛的实验，并报告了一系列最先进的记录。

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

下载PDF全文

下载文献需遵守相关版权规定

论文标题