对软件工程互动的当代毒性探测器的基准研究

论文标题

对软件工程互动的当代毒性探测器的基准研究

A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions

论文作者

Sarker, Jaydeb, Turzo, Asif Kamal, Bosu, Amiangshu

论文摘要

自动过滤有毒对话可能有助于开源软件（OSS）社区保持项目参与者之间的健康互动。尽管存在几种通用工具来识别有毒内容，但这些工具可能错误地标记了软件工程（SE）上下文中常用的一些单词（例如，'junk''，'kill'和'dump'），反之亦然。为了遇到这一挑战，通过将透视API的输出与斯坦福大学礼貌探测器工具的定制版本相结合，CMU Strudel Lab（称为“ Strudel”）提出了一种特定的工具。但是，由于仅使用654 SE文本的Strudel评估非常有限，因此其实际适用性尚不清楚。因此，这项研究旨在在大规模SE数据集上进行经验评估Strudel工具以及四个最先进的通用毒性探测器。在这个目标上，我们从经验上开发了一种标记有毒SE相互作用的标签。使用此标题，我们手动将6,533个代码评论评论和4,140个吉特消息的数据集标记为。我们的分析结果表明，所有工具在我们的数据集上的性能显着下降。在我们的正式SE通信数据集（例如代码审核）中，这些降解比我们的非正式通信数据集（例如Gitter消息）上的降解要高。我们研究中的两个模型显示，在我们在SE数据集中重新培训这些模型在10倍的交叉验证期间的性能得到了显着改善。根据我们对错误分类文本的手动调查，我们确定了一些针对特异性毒性检测器的建议。

Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors on a large scale SE dataset. On this goal, we empirically developed a rubric to manually label toxic SE interactions. Using this rubric, we manually labeled a dataset of 6,533 code review comments and 4,140 Gitter messages. The results of our analyses suggest significant degradation of all tools' performances on our datasets. Those degradations were significantly higher on our dataset of formal SE communication such as code review than on our dataset of informal communication such as Gitter messages. Two of the models from our study showed significant performance improvements during 10-fold cross validations after we retrained those on our SE datasets. Based on our manual investigations of the incorrectly classified text, we have identified several recommendations for developing an SE specific toxicity detector.

下载PDF全文

下载文献需遵守相关版权规定

论文标题