从已发布的软件包中提取学术知识提取

论文标题

从已发布的软件包中提取学术知识提取

Scholarly Knowledge Extraction from Published Software Packages

论文作者

Haris, Muhammad, Stocker, Markus, Auer, Sören

论文摘要

大量的科学软件包发表在存储库中，例如Zenodo和Figshare。这些软件包对于已发表研究的可重复性至关重要。作为学术知识图构建的另一种途径，我们提出了一种通过静态分析（Meta）数据和内容（特别是Python等语言）来自动提取机器可行（结构化）学术知识的方法。该方法可以总结如下。首先，我们通过利用软件元数据提取框架（Somef）和GitHub API来从软件包中提取元数据信息（软件说明，编程语言，相关引用）。其次，我们分析提取的元数据，以找到与相应软件存储库相关的研究文章。第三，对于已发布软件包中包含的软件，我们创建和分析抽象语法树（AST）表示，以提取有关数据执行过程的信息。第四，我们在相关文章的全文中搜索提取的信息，以将提取的信息限制为学术知识，即学术文献中发表的信息。最后，我们在开放研究知识图（ORKG）中发布了提取的机器可行的学术知识。

A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static analysis of their (meta)data and contents (in particular scripts in languages such as Python). The approach can be summarized as follows. First, we extract metadata information (software description, programming languages, related references) from software packages by leveraging the Software Metadata Extraction Framework (SOMEF) and the GitHub API. Second, we analyze the extracted metadata to find the research articles associated with the corresponding software repository. Third, for software contained in published packages, we create and analyze the Abstract Syntax Tree (AST) representation to extract information about the procedures performed on data. Fourth, we search the extracted information in the full text of related articles to constrain the extracted information to scholarly knowledge, i.e. information published in the scholarly literature. Finally, we publish the extracted machine actionable scholarly knowledge in the Open Research Knowledge Graph (ORKG).

下载PDF全文

下载文献需遵守相关版权规定

论文标题