论文标题

一种用于检测学术文献中非正式数据参考的自然语言处理管道

A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

论文作者

Lafia, Sara, Fan, Lizhou, Hemphill, Libby

论文摘要

发现出版物与使用的数据集之间的权威联系可能是一个劳动密集型过程。我们介绍了一条自然语言处理管道,该管道检索并审查了对研究数据集的非正式引用的出版物,该数据集补充了数据图书馆员的工作。我们首先描述管道的组成部分,然后将其应用于扩展权威参考书目,该参考书目将数千个社会科学研究与使用它们所使用的数据相关的出版物联系起来。该管道增加了文献的回忆,以审查包含在数据相关的出版物中,并使得可以大规模检测非正式数据参考。我们贡献了(1)一种名为“实体识别(NER)模型”的小说,该模型可靠地检测到非正式数据参考和(2)一个数据集,将社会科学文献中的项目与他们引用的数据集相连。这些贡献共同使未来的数据参考,数据引文网络和数据重用。

Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源