论文标题

平行序列标记概念识别

Parallel sequence tagging for concept recognition

论文作者

Furrer, Lenz, Cornelius, Joseph, Rinaldi, Fabio

论文摘要

背景:命名实体识别(NER)和归一化(NEN)是任何用于生物医学文本的文本挖掘系统的核心组成部分。在传统的概念识别管道中,这些任务以串行方式组合在一起,这本质上容易出现从NER到NEN的错误传播。我们提出了一个并行体系结构,其中NER和NEN均被建模为序列标记的任务,直接在源文本上运行。我们研究了将两个分类器的预测合并为单个输出序列的不同统一策略。结果:我们在Craft Copus的最新版本中测试了我们的方法。在概念保管任务的所有20个注释集中,我们的系统在2019年工艺共享任务中的基线均优于管道系统。结论:我们的分析表明,这两个分类器的优势可以以富有成果的方式组合在一起。但是,预测协调需要在每个注释集的开发集上进行单独的校准。这允许在既定知识(培训集)和新颖信息(看不见的概念)之间实现良好的权衡。可用性和实施​​:源代码可免费下载,请访问https://github.com/ontogene/craft-st。补充数据可在Arxiv Online上获得。

Background: Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. Results: We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task 2019. Conclusions: Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts). Availability and Implementation: Source code freely available for download at https://github.com/OntoGene/craft-st. Supplementary data are available at arXiv online.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源