论文标题

使用表示代码簿的多模式对齐

Multi-modal Alignment using Representation Codebook

论文作者

Duan, Jiali, Chen, Liqun, Tran, Son, Yang, Jinyu, Xu, Yi, Zeng, Belinda, Chilimbi, Trishul

论文摘要

来自不同方式的信号是视觉表示学习的重要一步,因为它影响了后期阶段的性能,例如跨模式融合。由于图像和文本通常位于特征空间的不同区域,因此在实例级别直接对齐它们是具有挑战性的,尤其是在训练过程中仍在发展时。在本文中,我们建议使用群集表示以更高,更稳定的水平对齐。具体而言,我们将图像和文本视为同一实体的两个“视图”,并将其编码为由集群中心词典(CodeBook)跨越的联合视觉语言编码空间。我们通过它们的聚类分配对比,同时优化集群中心。为了进一步平滑学习过程,我们采用了一个教师蒸馏范式,其中一种视图的势头老师指导了学生的学习。我们评估了对共同视觉语言基准测试的方法,并在零射击横近模态检索上获得新的SOTA,同时在其他各种转移任务上具有竞争力。

Aligning signals from different modalities is an important step in vision-language representation learning as it affects the performance of later stages such as cross-modality fusion. Since image and text typically reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. In this paper, we propose to align at a higher and more stable level using cluster representation. Specifically, we treat image and text as two "views" of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centers (codebook). We contrast positive and negative samples via their cluster assignments while simultaneously optimizing the cluster centers. To further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. We evaluated our approach on common vision language benchmarks and obtain new SoTA on zero-shot cross modality retrieval while being competitive on various other transfer tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源