论文标题
Oasum:大规模开放域基于方面的摘要
OASum: Large-Scale Open Domain Aspect-based Summarization
论文作者
论文摘要
基于方面或基于查询的摘要最近引起了更多的关注,因为它可以根据用户的兴趣产生差异化的摘要。但是,当前用于方面或基于查询的摘要的数据集要么集中在特定域上,要么包含相对较小的实例,要么仅包含一些方面类型。这种限制阻碍了这一方向的进一步探索。在这项工作中,我们利用了wikipedia.org上的众包知识,并自动创建了名为Oasum的高质量的,基于大规模开放域的基于方面的摘要数据集,该数据集包含超过370万个实例,在2000万Wikipedia页面上包含约100万个不同方面。我们在Oasum上提供基准结果,并证明其具有基于各个方面的总结生成的能力。为了克服特定域上的数据稀缺问题,我们还在七个下游数据集上执行零射击,很少射击和微调。具体而言,零/少量和微调结果表明,与骨干模型相比,在我们的语料库中预先训练的模型表现出强大的方面或注重查询的生成能力。我们的数据集和预训练的检查站公开可用。
Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OASum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.