论文标题
使用大规模肺炎和肺炎研究进行了弱监督的一阶段视力和语言疾病检测
Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies
论文作者
论文摘要
尽管缺乏详细的标签,但尽管数据集很大,但在医学图像中检测临床相关的对象是一个挑战。为了解决标签问题,我们使用结合了自然语言信息的检测体系结构的场景级标签。我们介绍了一套具有挑战性的新型放射科医生配对的边界框和自然语言注释,尤其是专注于肺炎和肺炎肺炎的公开使用的模仿CXR数据集。与数据集一起,我们提出了一种联合视觉语言,弱监督的变压器层选择的一阶段双头检测体系结构(LILTATI)以及与类激活映射(CAM),渐变CAM以及NIH CHESTXRAY-14和MIMIC-CXR数据集对类激活映射(CAM),梯度CAM以及相关实现的强基线比较。从视觉语言体系结构的进步中借用的文字方法展示了联合图像和引用表达(使用自然语言在图像中本地化的对象)输入,以以纯粹弱监督的方式缩放。建筑修改解决了三个障碍 - 以弱监督的方式实施监督的视觉和语言检测方法,并结合临床引用自然语言信息,并与地图概率产生高忠诚度检测。然而,放射科医生注释的挑战性临床性质包括微妙的参考,多构想规范以及相对冗长的医学报告,可确保大规模视觉语言检测任务仍然令人兴奋,以刺激未来的研究。
Detecting clinically relevant objects in medical images is a challenge despite large datasets due to the lack of detailed labels. To address the label issue, we utilize the scene-level labels with a detection architecture that incorporates natural language information. We present a challenging new set of radiologist paired bounding box and natural language annotations on the publicly available MIMIC-CXR dataset especially focussed on pneumonia and pneumothorax. Along with the dataset, we present a joint vision language weakly supervised transformer layer-selected one-stage dual head detection architecture (LITERATI) alongside strong baseline comparisons with class activation mapping (CAM), gradient CAM, and relevant implementations on the NIH ChestXray-14 and MIMIC-CXR dataset. Borrowing from advances in vision language architectures, the LITERATI method demonstrates joint image and referring expression (objects localized in the image using natural language) input for detection that scales in a purely weakly supervised fashion. The architectural modifications address three obstacles -- implementing a supervised vision and language detection method in a weakly supervised fashion, incorporating clinical referring expression natural language information, and generating high fidelity detections with map probabilities. Nevertheless, the challenging clinical nature of the radiologist annotations including subtle references, multi-instance specifications, and relatively verbose underlying medical reports, ensures the vision language detection task at scale remains stimulating for future investigation.