论文标题
部分注释的NER CORPON上的Finetuning Bert
Finetuning BERT on Partially Annotated NER Corpora
论文作者
论文摘要
大多数命名的实体识别(NER)模型都是在训练数据集完全标记的假设下运行的。尽管它对Conll 2003和Ontonotes等已建立的数据集有效,但有时获得完整的数据集注释是不可行的。例如,在选择性注释实体以降低成本后,可能会发生这些情况。这项工作提出了一种使用自学和标签预处理的预订数据集上填充BERT的方法。我们的方法的表现优于以前基于LSTM的标签预处理基线,从而显着提高了标签较差的数据集的性能。我们证明,遵循我们在Conll 2003数据集上的Roberta时,只有10%的总实体的10%足以达到在同一数据集上训练的基线的性能,并标有50%的实体。
Most Named Entity Recognition (NER) models operate under the assumption that training datasets are fully labelled. While it is valid for established datasets like CoNLL 2003 and OntoNotes, sometimes it is not feasible to obtain the complete dataset annotation. These situations may occur, for instance, after selective annotation of entities for cost reduction. This work presents an approach to finetuning BERT on such partially labelled datasets using self-supervision and label preprocessing. Our approach outperforms the previous LSTM-based label preprocessing baseline, significantly improving the performance on poorly labelled datasets. We demonstrate that following our approach while finetuning RoBERTa on CoNLL 2003 dataset with only 10% of total entities labelled is enough to reach the performance of the baseline trained on the same dataset with 50% of the entities labelled.