论文标题
通用医学视觉表示学习的多晶跨模式对齐
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning
论文作者
论文摘要
直接从配对放射学报告中学习医学视觉表示已成为代表学习中的新主题。但是,现有的医学图像文本联合学习方法受到实例或局部监督分析的限制,忽略了疾病水平的语义对应关系。在本文中,我们提出了一种新型的多粒性跨模式比对(MGCA)框架,用于通过利用自然的医学图像和放射学报告在三种不同级别(即病理区域级别,实例,实例,实例,疾病级别和疾病水平)之间的自然语义对应,用于广义医学视觉表示学习。具体而言,我们首先通过最大化图像报告对之间的一致性来结合实例对齐模块。此外,对于代币的对齐方式,我们引入了双向交叉注意策略,以明确学习细粒度的视觉令牌和文本令牌之间的匹配,然后进行对比学习以对齐它们。更重要的是,为了利用高级主体间关系语义(例如疾病)对应关系,我们设计了一种新型的跨模式疾病级比对范式来实施跨模式群集分配一致性。涵盖图像分类,对象检测和语义分割任务的七个下游医疗图像数据集的广泛实验结果证明了我们框架的稳定和出色的性能。
Learning medical visual representations directly from paired radiology reports has become an emerging topic in representation learning. However, existing medical image-text joint learning methods are limited by instance or local supervision analysis, ignoring disease-level semantic correspondences. In this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA) framework for generalized medical visual representation learning by harnessing the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels, i.e., pathological region-level, instance-level, and disease-level. Specifically, we first incorporate the instance-wise alignment module by maximizing the agreement between image-report pairs. Further, for token-wise alignment, we introduce a bidirectional cross-attention strategy to explicitly learn the matching between fine-grained visual tokens and text tokens, followed by contrastive learning to align them. More important, to leverage the high-level inter-subject relationship semantic (e.g., disease) correspondences, we design a novel cross-modal disease-level alignment paradigm to enforce the cross-modal cluster assignment consistency. Extensive experimental results on seven downstream medical image datasets covering image classification, object detection, and semantic segmentation tasks demonstrate the stable and superior performance of our framework.