充分利用文本语义来改善生物医学视觉 - 语言处理

论文标题

充分利用文本语义来改善生物医学视觉 - 语言处理

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

论文作者

Boecking, Benedikt, Usuyama, Naoto, Bannur, Shruthi, Castro, Daniel C., Schwaighofer, Anton, Hyland, Stephanie, Wetscherek, Maria, Naumann, Tristan, Nori, Aditya, Alvarez-Valle, Javier, Poon, Hoifung, Oktay, Ozan

论文摘要

生物医学中的多模式数据遍布，例如放射学图像和报告。大规模解释这些数据对于改善临床护理和加速临床研究至关重要。与一般领域相比，具有复杂语义的生物医学文本在视觉建模中提出了其他挑战，并且先前的工作使用了缺乏特定领域的语言理解的适应性不足的模型。在本文中，我们表明，有原则的文本语义建模可以基本上改善自我监督视力 - 语言处理中的对比度学习。我们发布了一种实现最先进的语言模型，从而通过改进的词汇和新颖的语言预处理客观，从而在放射学报告中利用语义和话语特征，从而导致自然语言推断。此外，我们提出了一种自我监督的联合视觉 - 语言方法，重点是更好的文本建模。它在广泛的公开基准上建立了新的最新结果，部分是利用我们新的特定领域的语言模型。我们释放了一个新的数据集，该数据集具有放射科医生的本地对准短语接地注释，以促进生物医学视觉过程中复杂语义建模的研究 - 语言处理。一项广泛的评估，包括在此新数据集中，表明我们的对比学习方法在文本语义建模的帮助下，尽管仅使用了全球对准目标，但在分割任务中的表现都优于分段任务中的先前方法。

Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses additional challenges in vision--language modelling compared to the general domain, and previous work has used insufficiently adapted models that lack domain-specific language understanding. In this paper, we show that principled textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing. We release a language model that achieves state-of-the-art results in radiology natural language inference through its improved vocabulary and novel language pretraining objective leveraging semantics and discourse characteristics in radiology reports. Further, we propose a self-supervised joint vision--language approach with a focus on better text modelling. It establishes new state of the art results on a wide range of publicly available benchmarks, in part by leveraging our new domain-specific language model. We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing. A broad evaluation, including on this new dataset, shows that our contrastive learning approach, aided by textual-semantic modelling, outperforms prior methods in segmentation tasks, despite only using a global-alignment objective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题