论文标题
BionerFlair:使用Flair嵌入和序列标记器的生物医学命名实体识别
BioNerFlair: biomedical named entity recognition using flair embedding and sequence tagger
论文作者
论文摘要
动机:生物医学研究文章的扩散使信息检索的任务比以往任何时候都更为重要。科学家和研究人员很难找到包含与之相关的信息的文章。适当提取疾病,药物/化学,物种,基因/蛋白质等生物医学实体可以大大改善文章的过滤,从而更好地提取相关信息。由于BERT,XLNET,OPENAI,GPT2等基于变形金刚的模型的发展,Bioner基准测试的性能逐渐改善。这些模型可获得出色的结果。但是,它们在计算上很昂贵,我们可以使用其他基于上下文字符串的模型和基于LSTM-CRF的序列标签器获得更高的特定领域任务分数。结果:我们介绍了BionerFlair,这是一种使用Flair Plus Plus Glove Embeddings和基于双向LSTM-CRF的序列标记器来训练生物医学命名实体识别模型的方法。 BionerFlair几乎广泛用于命名实体识别的通用体系结构优于先前的最新模型。我在8个基准数据集上进行了实验,以实现命名实体识别。与当前的最新模型相比,BionerFlair在生物依赖性II基因提及(BC2GM)语料库上达到了90.17的最佳F1得分,超过84.72,94.03的最佳F1得分超过92.36,超过92.36,在生物治疗性IV化学和药物(BC4CHEMD)CORPUS,BEST F1 SCORE超过88.7.7.7.7.7.7.7.7.7.7.7.7 score上JNLPBA语料库,最佳的F1得分为91.1超过89.71,在NCBI疾病语料库上,最佳的F1评分在物种-800的语料库上超过78.98,而BC5CDR-CHEM,BC3CDR-DISEASE和LINNAEUS CORPUS在BC5CDR-CHEM上观察到了几乎最佳的结果。
Motivation: The proliferation of Biomedical research articles has made the task of information retrieval more important than ever. Scientists and Researchers are having difficulty in finding articles that contain information relevant to them. Proper extraction of biomedical entities like Disease, Drug/chem, Species, Gene/protein, can considerably improve the filtering of articles resulting in better extraction of relevant information. Performance on BioNer benchmarks has progressively improved because of progression in transformers-based models like BERT, XLNet, OpenAI, GPT2, etc. These models give excellent results; however, they are computationally expensive and we can achieve better scores for domain-specific tasks using other contextual string-based models and LSTM-CRF based sequence tagger. Results: We introduce BioNerFlair, a method to train models for biomedical named entity recognition using Flair plus GloVe embeddings and Bidirectional LSTM-CRF based sequence tagger. With almost the same generic architecture widely used for named entity recognition, BioNerFlair outperforms previous state-of-the-art models. I performed experiments on 8 benchmarks datasets for biomedical named entity recognition. Compared to current state-of-the-art models, BioNerFlair achieves the best F1-score of 90.17 beyond 84.72 on the BioCreative II gene mention (BC2GM) corpus, best F1-score of 94.03 beyond 92.36 on the BioCreative IV chemical and drug (BC4CHEMD) corpus, best F1-score of 88.73 beyond 78.58 on the JNLPBA corpus, best F1-score of 91.1 beyond 89.71 on the NCBI disease corpus, best F1-score of 85.48 beyond 78.98 on the Species-800 corpus, while near best results was observed on BC5CDR-chem, BC3CDR-disease, and LINNAEUS corpus.