经验：软件努力估算中使用的数据集的质量基准测试

论文标题

经验：软件努力估算中使用的数据集的质量基准测试

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

论文作者

Bosu, Michael F., MacDonell, Stephen G.

论文摘要

数据是经验软件工程（ESE）研究和实践的基石。数据是众多流程和项目管理活动的基础，包括估计开发工作以及对代码中缺陷的可能位置和严重性的预测。但是，关于ESE中使用的数据的质量已经提出了严重的问题。噪音，异常值和不完整引起的数据质量问题已被认为特别普遍。其他质量问题虽然也很重要，但受到关注较少。在这项研究中，我们评估了在软件努力估算研究中广泛使用的13个数据集的质量。本文考虑的质量问题取决于我们先前根据ESE中数据质量问题的系统映射发布的分类法。我们的贡献如下：（1）对这些常用数据集的“适合性”的评估，以及（2）评估分类学用数据集基准测试的效用。我们还提出了一个模板，该模板既可以用来改进ESE数据收集/提交过程，又可以评估其他此类数据集，从而提高了ESE社区中对数据质量问题的认识，以及及时的可用性和使用高质量数据集的可用性和使用。

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the "fitness for purpose" of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题