论文标题

Longtonotes:具有更长核心链的ontonotes

Longtonotes: OntoNotes with Longer Coreference Chains

论文作者

Shridhar, Kumar, Monath, Nicholas, Thirukovalluru, Raghuveer, Stolfo, Alessandro, Zaheer, Manzil, McCallum, Andrew, Sachan, Mrinmaya

论文摘要

Ontonotes已成为核心分辨率最重要的基准。但是,为了易于注释,Ontonotes中的几个长文档分为较小的部分。在这项工作中,我们构建了比当前可用的文档长度明显更长的核心注销文档语料库。我们这样做是通过提供准确,手动策划的,从文档中合并注释,这些注释分为原始Ontonotes注释过程中的多个部分。我们称之为Longtonotes的结果语料库包含多种类型的英语文档,其长度不同,其中最长的文档是Ontonotes中文档长度的最高8倍,而Litbank的文档则为2倍。我们在这个新的语料库上评估了最新的神经核心系统,分析模型体系结构/超参数之间的关系以及模型的性能和效率的文档长度,并展示了我们新语料库揭示的长期文献核心模型的改进领域。我们的数据和代码可在以下网址获得:https://github.com/kumar-shridhar/longtonotes。

Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modeling revealed by our new corpus. Our data and code is available at: https://github.com/kumar-shridhar/LongtoNotes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源