论文标题
Razmecheno:数字日记档案中的实体识别“ Prozhito”
Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito"
论文作者
论文摘要
命名实体识别(NER)的绝大多数现有数据集主要基于新闻,研究论文和维基百科,但有一些例外,这些例外是由历史和文学文本创建的。更重要的是,英语是用于进一步标签的数据的主要来源。本文旨在通过创建一个新颖的数据集“ Razmecheno”来填补多个空白,该数据集是从俄语“ Prozhito”的日记文本中收集的。我们的数据集对多个研究列表很感兴趣:日记文本的文学研究,从其他领域转移学习,低资源或跨语言命名实体识别。 Razmecheno包含1331个句子和14119代币,在Perestroika期间编写的日记中采样。注释模式由五个常用的实体标签组成:人员,特征,位置,组织和设施。该标签是在Yandex.toloka的众包上进行的,分为两个阶段。首先,工人选择了包含特定类型实体的句子。其次,他们标记了实体跨度。结果,获得了1113个实体。 Razmecheno的经验评估是使用现成的工具和微调预培训的上下文编码器进行的。我们发布带注释的数据集以进行开放访问。
The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First, workers selected sentences, which contain an entity of particular type. Second, they marked up entity spans. As a result 1113 entities were obtained. Empirical evaluation of Razmecheno is carried out with off-the-shelf NER tools and by fine-tuning pre-trained contextualized encoders. We release the annotated dataset for open access.