论文标题

跨语言摘要的模型和数据集

Models and Datasets for Cross-Lingual Summarisation

论文作者

Perez-Beltrachini, Laura, Lapata, Mirella

论文摘要

我们提供了一个跨语性摘要语料库,其中包含与目标语言的多句摘要相关的源语言的长文档。该语料库涵盖了四种欧洲语言的十二种语言和指示,即捷克语,英语,法语和德语,其创建方法可以应用于其他几种语言。我们通过将铅段和文章的尸体结合到语言对齐的Wikipedia标题中,从Wikipedia得出了跨语性的文档 - 苏格尼实例。我们通过自动指标分析了提出的跨语性摘要任务,并通过人类研究对其进行验证。为了说明数据集的实用性,我们报告了在监督,零和几乎没有射击以及跨域的情况下使用多语言预训练的模型进行的实验。

We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German, and the methodology for its creation can be applied to several other languages. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles. We analyse the proposed cross-lingual summarisation task with automatic metrics and validate it with a human study. To illustrate the utility of our dataset we report experiments with multi-lingual pre-trained models in supervised, zero- and few-shot, and out-of-domain scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源