CSL：大规模的中国科学文献数据集

论文标题

CSL：大规模的中国科学文献数据集

CSL: A Large-scale Chinese Scientific Literature Dataset

论文作者

Li, Yudong, Zhang, Yuqing, Zhao, Zhe, Shen, Linlin, Liu, Weijie, Mao, Weiquan, Zhang, Hui

论文摘要

科学文献是高质量的语料库，支持大量自然语言处理（NLP）研究。但是，现有数据集围绕英语，这限制了中国科学NLP的发展。在这项工作中，我们提出了CSL，这是一个大规模的中国科学文献数据集，其中包含396K论文的标题，摘要，关键字和学术领域。据我们所知，CSL是中文中的第一个科学文档数据集。 CSL可以用作中国语料库。同样，该半结构化数据是一种自然注释，可以构成许多监督的NLP任务。基于CSL，我们提出了一个基准，以评估跨科学领域任务的模型的性能，即摘要，关键字生成和文本分类。我们分析了现有文本到文本模型在评估任务上的行为，并揭示了中国科学NLP任务的挑战，这为将来的研究提供了宝贵的参考。数据和代码可在https://github.com/ydli-ai/csl上找到

Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL

下载PDF全文

下载文献需遵守相关版权规定

论文标题