论文标题
中文临床文本的统一医疗信息注释和提取的框架
A Unified Framework of Medical Information Annotation and Extraction for Chinese Clinical Text
论文作者
论文摘要
医疗信息提取由一组自然语言处理(NLP)任务组成,该任务将临床文本合作转换为预定的结构化格式。当前最新的NLP模型与深度学习技术高度集成,因此需要大量的注释语言数据。这项研究提出了医学实体识别,关系提取和属性提取的工程框架,该框架在注释,建模和评估中统一。具体而言,注释方案是全面的,并且在任务之间兼容,尤其是对于医疗关系。由此产生的带注释的语料库包括1,200个完整的病历(或18,039个破碎的文件),并获得了94.53%,73.73%和91.98%F 1分数的通知者协议(IAAS)。在共享结构中开发了三个特定于任务的神经网络模型,并通过SOTA NLP技术(即预先训练的语言模型)增强。实验结果表明,该系统可以以F 1分的93.47%,67.14%和90.89%的成绩检索医疗实体,关系和属性。这项研究除了我们公开发布的注释计划和代码外,还提供了开发综合医学信息提取系统的扎实,实用的工程经验。
Medical information extraction consists of a group of natural language processing (NLP) tasks, which collaboratively convert clinical text to pre-defined structured formats. Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques and thus require massive annotated linguistic data. This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction, which are unified in annotation, modeling and evaluation. Specifically, the annotation scheme is comprehensive, and compatible between tasks, especially for the medical relations. The resulted annotated corpus includes 1,200 full medical records (or 18,039 broken-down documents), and achieves inter-annotator agreements (IAAs) of 94.53%, 73.73% and 91.98% F 1 scores for the three tasks. Three task-specific neural network models are developed within a shared structure, and enhanced by SOTA NLP techniques, i.e., pre-trained language models. Experimental results show that the system can retrieve medical entities, relations and attributes with F 1 scores of 93.47%, 67.14% and 90.89%, respectively. This study, in addition to our publicly released annotation scheme and code, provides solid and practical engineering experience of developing an integrated medical information extraction system.