用于GITHUB存储库重复数据删除的数据集：扩展描述

论文标题

用于GITHUB存储库重复数据删除的数据集：扩展描述

A Dataset for GitHub Repository Deduplication: Extended Description

论文作者

Spinellis, Diomidis, Kotti, Zoe, Mockus, Audris

论文摘要

可以轻松地通过网站的叉子过程或通过GIT克隆式PUSH序列复制GitHub项目。这对于经验软件工程来说是一个问题，因为它可能导致结果偏斜或机器学习模型。我们提供了1,060万个GitHub项目的数据集，这些数据集是他人的副本，并将每个记录与项目的最终父母联系起来。最终的父母源自六个指标的排名。相关项目的计算是通过将边缘指向最终父母创建的1820万节点和1200万个边缘的图形的连接组件。该图是通过滤除了30多个手工挑选和230万个图案匹配的集成项目来创建的。通过反复可视化无关的重要项目之间的最短路径距离来确定引入不需要结块的项目。我们的数据集在现有的180万个项目的流行参考数据集中确定了3万个重复的项目。对我们的数据集对另一个独立创建的数据集的评估发现了显着的重叠，但也归因于项目被视为相关的操作定义。

GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

下载PDF全文

下载文献需遵守相关版权规定

论文标题