UNI-PEREVER V2：大规模视觉和视觉任务的通才模型

论文标题

UNI-PEREVER V2：大规模视觉和视觉任务的通才模型

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

论文作者

Li, Hao, Zhu, Jinguo, Jiang, Xiaohu, Zhu, Xizhou, Li, Hongsheng, Yuan, Chun, Wang, Xiaohua, Qiao, Yu, Wang, Xiaogang, Wang, Wenhai, Dai, Jifeng

论文摘要

尽管基础模型取得了显着的成功，但他们特定于任务的微调范式使它们与一般感知建模的目标不一致。消除这种不一致的关键是将通才模型用于一般任务建模。但是，通才模型的现有尝试在多功能性和性能方面都不足。在本文中，我们提出了Uni-Pehceiver V2，这是第一个能够以竞争性能来处理主要的大型视力和视力语言任务的通才模型。具体而言，图像被编码为一般区域建议，而文本是通过基于变压器的语言模型编码的。编码的表示由任务不合时宜的解码器转换。不同的任务被称为统一的最大似然估计问题。我们进一步提出了改进的优化器，以确保使用无混合的采样策略确保稳定的多任务学习，这有助于需要大批量培训的任务。在进行各种任务的联合培训之后，Uni-Pecter V2能够直接处理下游任务而无需任何特定任务的适应。结果表明，Uni-Pectiver V2在多功能性和性能方面都优于所有现有的通才模型。同时，与需要特定于任务的微调的普遍认识的强基线相比，Uni-Pectever V2在广泛的视觉和视觉语言任务上实现了竞争性能。

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题