视觉概念令牌化

论文标题

视觉概念令牌化

Visual Concepts Tokenization

论文作者

Yang, Tao, Wang, Yuwang, Lu, Yan, Zheng, Nanning

论文摘要

从混凝土像素中抽象视觉概念的人类感知能力一直是机器学习研究领域的基本和重要目标，例如分离的表示和场景分解。为了实现这一目标，我们提出了一个无监督的基于变压器的视觉概念令牌化框架（称为VCT），以将图像感知到一组分散的视觉概念令牌中，每个概念代币都响应一种独立的视觉概念。特别是，要获取这些概念令牌，我们仅使用跨注意事项从图像代币中提取视觉信息，而无需在概念令牌之间进行自我注意力，从而防止概念令牌泄漏信息。我们进一步提出了一个概念，解散损失，以促进不同的概念令牌代表独立的视觉概念。交叉注意力和解开损失分别在概念代币中发挥了归纳和相互排斥的作用。在几个流行数据集上进行的大量实验验证了VCT对分离表示的学习和场景分解任务的有效性。 VCT实现了最大的水平。

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题