使用突变测试来衡量行为测试多样性

论文标题

使用突变测试来衡量行为测试多样性

Using mutation testing to measure behavioural test diversity

论文作者

Neto, Francisco Gomes de Oliveira, Dobslaw, Felix, Feldt, Robert

论文摘要

已提出多样性作为提高测试效率和效率的关键标准。它可用于优化大型测试存储库，同时可视化测试维护问题并提高从业者对测试工具和过程中的废物的认识。即使这些基于多样性的测试技术旨在在测试系统（SUT）中行使各种行为，但多样性主要是在人工制品（例如输入，输出或测试脚本）之间进行测量的。在这里，我们介绍了一系列措施，通过比较其执行和失败结果来捕获测试案例的行为多样性（B-DIV）。使用失败信息捕获SUT行为已显示出可提高基于历史的测试优先级方法的有效性。但是，基于历史记录的技术需要可靠的测试执行日志，这些日志通常是由于片状测试，测试执行的稀缺而难以获得的，或者可能难以获得。我们相反，我们建议使用突变测试来通过在SUT的各种突变版本上运行一组测试案例来测量行为多样性。具体而言，我们提出了两项特定的B型措施（基于准确性和Matthew的相关系数），并将它们与基于人工制品的多样性（A-DIV）进行比较，以优先考虑6个不同的开源项目的测试套件。我们的结果表明，我们的B-DIV测量在所有研究项目中都超过A-DIV和随机选择。根据优先测试子集的大小，检测到的故障（APFD）平均百分比的平均百分比在19％至31％之间的平均百分比平均增加。

Diversity has been proposed as a key criterion to improve testing effectiveness and efficiency.It can be used to optimise large test repositories but also to visualise test maintenance issues and raise practitioners' awareness about waste in test artefacts and processes. Even though these diversity-based testing techniques aim to exercise diverse behavior in the system under test (SUT), the diversity has mainly been measured on and between artefacts (e.g., inputs, outputs or test scripts). Here, we introduce a family of measures to capture behavioural diversity (b-div) of test cases by comparing their executions and failure outcomes. Using failure information to capture the SUT behaviour has been shown to improve effectiveness of history-based test prioritisation approaches. However, history-based techniques require reliable test execution logs which are often not available or can be difficult to obtain due to flaky tests, scarcity of test executions, etc. To be generally applicable we instead propose to use mutation testing to measure behavioral diversity by running the set of test cases on various mutated versions of the SUT. Concretely, we propose two specific b-div measures (based on accuracy and Matthew's correlation coefficient, respectively) and compare them with artefact-based diversity (a-div) for prioritising the test suites of 6 different open-source projects. Our results show that our b-div measures outperform a-div and random selection in all of the studied projects. The improvement is substantial with an average increase in average percentage of faults detected (APFD) of between 19% to 31% depending on the size of the subset of prioritised tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题