监督机器学习算法在预测软件重构方面的有效性

论文标题

监督机器学习算法在预测软件重构方面的有效性

The Effectiveness of Supervised Machine Learning Algorithms in Predicting Software Refactoring

论文作者

Aniche, Maurício, Maziero, Erick, Durelli, Rafael, Durelli, Vinicius

论文摘要

重构是更改软件内部结构以提高其质量而不修改外部行为的过程。经验研究反复表明，重构对软件系统的可理解性和可维护性有积极影响。但是，在进行重构活动之前，开发人员需要确定重构机会。目前，重构机会识别在很大程度上依赖开发人员的专业知识和直觉。在本文中，我们研究了机器学习算法在预测软件重构中的有效性。更具体地说，我们使用一个数据集训练六种不同的机器学习算法（即逻辑回归，天真的贝叶斯，支持矢量机，决策树，随机森林和神经网络），其中包含来自Apache，F-Droid和Github Ecosystems的11,149个现实世界项目的200万多个重构。最终的模型可以预测类，方法和可变级别的20种不同的重构，其精度通常高于90％。我们的结果表明，（i）随机森林是预测软件重构的最佳模型，（ii）流程和所有权指标似乎在创建更好模型的创建中起着至关重要的作用，并且（iii）模型在不同的情况下可以很好地推广。

Refactoring is the process of changing the internal structure of software to improve its quality without modifying its external behavior. Empirical studies have repeatedly shown that refactoring has a positive impact on the understandability and maintainability of software systems. However, before carrying out refactoring activities, developers need to identify refactoring opportunities. Currently, refactoring opportunity identification heavily relies on developers' expertise and intuition. In this paper, we investigate the effectiveness of machine learning algorithms in predicting software refactorings. More specifically, we train six different machine learning algorithms (i.e., Logistic Regression, Naive Bayes, Support Vector Machine, Decision Trees, Random Forest, and Neural Network) with a dataset comprising over two million refactorings from 11,149 real-world projects from the Apache, F-Droid, and GitHub ecosystems. The resulting models predict 20 different refactorings at class, method, and variable-levels with an accuracy often higher than 90%. Our results show that (i) Random Forests are the best models for predicting software refactoring, (ii) process and ownership metrics seem to play a crucial role in the creation of better models, and (iii) models generalize well in different contexts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题