论文标题
tformer:用于多模式皮肤病变诊断的整个融合变压器
TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis
论文作者
论文摘要
多模式的皮肤病变诊断(MSLD)通过基于深度卷积的现代计算机辅助诊断(CAD)技术取得了巨大的成功。但是,由于严重程度不一致的空间分辨率(例如,皮肤镜图像和临床图像)和异质数据(例如,皮肤镜面图像和患者的元数据),MSLD跨模式的信息聚集仍然具有挑战性。受到固有的局部关注的限制,最新的使用纯卷积的MSLD管道难以捕获浅层层中的代表性特征,因此,即使在最后一层,通常在管道末端进行跨不同方式的融合,从而导致信息聚合不足。为了解决这个问题,我们引入了一种纯粹的基于变压器的方法,我们将其称为``整个Fusion Transformer(tformer)'',以在MSLD中进行足够的信息集成。与现有的卷积方法不同,提议的网络利用变压器作为特征提取主链,从而带来了更多代表性的浅色特征。然后,我们小心地设计了一堆双支线分层多模式变压器(HMT)块,以逐阶段的方式融合不同图像模式的信息。借助图像模式的汇总信息,多模式变压器后融合(MTP)块旨在整合图像和非图像数据的特征。这种策略首先融合了图像模式的信息,因此异质的信息使我们能够更好地分裂和征服这两个主要挑战,同时确保有效建模模式间动力学。
Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by modern computer-aided diagnosis (CAD) technology based on deep convolutions. However, the information aggregation across modalities in MSLD remains challenging due to severity unaligned spatial resolution (e.g., dermoscopic image and clinical image) and heterogeneous data (e.g., dermoscopic image and patients' meta-data). Limited by the intrinsic local attention, most recent MSLD pipelines using pure convolutions struggle to capture representative features in shallow layers, thus the fusion across different modalities is usually done at the end of the pipelines, even at the last layer, leading to an insufficient information aggregation. To tackle the issue, we introduce a pure transformer-based method, which we refer to as ``Throughout Fusion Transformer (TFormer)'', for sufficient information integration in MSLD. Different from the existing approaches with convolutions, the proposed network leverages transformer as feature extraction backbone, bringing more representative shallow features. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. With the aggregated information of image modalities, a multi-modal transformer post-fusion (MTP) block is designed to integrate features across image and non-image data. Such a strategy that information of the image modalities is firstly fused then the heterogeneous ones enables us to better divide and conquer the two major challenges while ensuring inter-modality dynamics are effectively modeled.