论文标题
与种族灭绝有关的法院成绩单中基于主题的段落分类的新数据集
A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts
论文作者
论文摘要
自然语言处理的最新进展在许多不同领域都令人印象深刻,基于变压器的方法为广泛的应用设定了新的基准。这一发展还降低了NLP社区以外的人们利用应用于各种特定领域应用程序的工具和资源的障碍。然而,瓶颈仍然缺乏注释的金标准收藏,一旦一个人的研究或专业兴趣就不在容易获得的范围之内。这样一个领域是与种族灭绝有关的研究(还包括对访问,探索和搜索有关该主题的大规模文档收集的专业兴趣的专家工作,例如律师)。我们介绍了GTC(种族灭绝转录物语料库),这是与种族灭绝相关的法院成绩单的第一个注释语料库,它具有三个目的:(1)为社区提供第一个参考语料库,(2),以建立基于最新的变压器的方法(使用基于最新的变形金刚的方法),以探索竞争者的规定,以探索企业的规定,(3)。我们认为我们的贡献尤其是在今年关于所有人的语言技术的热门话题。
Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one's research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year's hot topic on Language Technology for All.