论文标题

生物医学文献中提到的大量软件数据集

A large dataset of software mentions in the biomedical literature

论文作者

Istrate, Ana-Maria, Li, Donghui, Taraborelli, Dario, Torkar, Michaela, Veytsman, Boris, Williams, Ivana

论文摘要

我们描述了CZ软件提到的数据集,这是一个新的软件研究数据集中的生物医学论文。普通文本软件的提及是从多个来源的训练有素的SCIBERT模型中提取的:NIH PubMed Central Collection以及各种出版商提供的论文到Chan Zuckerberg倡议。数据集提供来源,上下文和元数据,以及许多提及的软件实体和链接。我们从NIH PMC-OA商业子集中提取了112万个独特的弦乐软件,其中来自NIH PMC-OA的481K唯一提及,来自NIH PMC-OA非商业子集(均于2021年10月收集)和出版商收藏中的300万篇论文的934K唯一提及。论文中提到软件的方式有所不同,并由NER算法提取。我们提出了一种基于聚类的歧义算法,以将纯文本软件映射到不同的软件实体中,并将​​其应用于NIH PubMed Central Commercial Collection。通过这种方法,我们将NER模型提取的112万个独特的字符串拆除为97600独特的软件实体,覆盖了所有软件纸链接的78%。我们将185000的提及链接到存储库,覆盖所有软件纸链接的55%。我们详细描述了构建数据集,消除和链接软件提及的过程的过程,以及此大小的数据集带来的机会和挑战。我们将所有数据和代码公开作为新资源,以帮助评估软件(特别是科学开源项目)对科学的影响。

We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78% of all software-paper links. We link 185000 of the mentions to a repository, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源