针对各种命名实体识别任务的基于实体到文本的数据增强

论文标题

针对各种命名实体识别任务的基于实体到文本的数据增强

Entity-to-Text based Data Augmentation for various Named Entity Recognition Tasks

论文作者

Hu, Xuming, Jiang, Yong, Liu, Aiwei, Huang, Zhongqiang, Xie, Pengjun, Huang, Fei, Wen, Lijie, Yu, Philip S.

论文摘要

数据增强技术已被用来减轻各种NER任务（扁平，嵌套和不连续的NER任务）中稀缺标记的数据的问题。现有的增强技术要么操纵原始文本中的单词打破了文本的语义连贯性，要么利用忽略原始文本中保存实体的生成模型，这阻碍了对嵌套和不连续的NER任务使用增强技术的使用。在这项工作中，我们提出了一种新型实体至文本的数据增强技术，名为ENTDA，以在原始文本的实体列表中添加，删除，替换或交换实体，并采用这些增强实体列表以生成语义相干和实体为各种NER任务保存文本。此外，我们引入了多样性光束搜索，以增加文本生成过程中的多样性。在三个任务（扁平，嵌套和不连续的NER任务）和两个设置（完整数据和低资源设置）上进行了13个NER数据集的实验表明，与基线增强技术相比，ENTDA可以带来更多的性能改进。

Data augmentation techniques have been used to alleviate the problem of scarce labeled data in various NER tasks (flat, nested, and discontinuous NER tasks). Existing augmentation techniques either manipulate the words in the original text that break the semantic coherence of the text, or exploit generative models that ignore preserving entities in the original text, which impedes the use of augmentation techniques on nested and discontinuous NER tasks. In this work, we propose a novel Entity-to-Text based data augmentation technique named EnTDA to add, delete, replace or swap entities in the entity list of the original texts, and adopt these augmented entity lists to generate semantically coherent and entity preserving texts for various NER tasks. Furthermore, we introduce a diversity beam search to increase the diversity during the text generation process. Experiments on thirteen NER datasets across three tasks (flat, nested, and discontinuous NER tasks) and two settings (full data and low resource settings) show that EnTDA could bring more performance improvements compared to the baseline augmentation techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题