论文标题
perpada:基于隐式众包数据收集的波斯解释数据集
PerPaDa: A Persian Paraphrase Dataset based on Implicit Crowdsourcing Data Collection
论文作者
论文摘要
在本文中,我们介绍了Perpada,这是一种从用户在pla窃检测系统中的输入收集的波斯解释数据集。作为一种隐性的众包经验,我们收集了Hamtajoo的大量原始和释义的句子。波斯pla窃检测系统,用户试图通过解释和重新提交手稿来掩盖其文档中文本案例的案例。编译的数据集包含2446个释义实例。为了提高收集到的数据的整体质量,一些启发式方法已被用来排除不符合拟议标准的句子。引入的语料库比可用的数据集大得多,用于波斯语中的释义标识任务。此外,与类似数据集相比,数据的偏差较少,因为用户没有尝试一些固定的预定义规则来生成与其原始输入的类似文本。
In this paper we introduce PerPaDa, a Persian paraphrase dataset that is collected from users' input in a plagiarism detection system. As an implicit crowdsourcing experience, we have gathered a large collection of original and paraphrased sentences from Hamtajoo; a Persian plagiarism detection system, in which users try to conceal cases of text re-use in their documents by paraphrasing and re-submitting manuscripts for analysis. The compiled dataset contains 2446 instances of paraphrasing. In order to improve the overall quality of the collected data, some heuristics have been used to exclude sentences that don't meet the proposed criteria. The introduced corpus is much larger than the available datasets for the task of paraphrase identification in Persian. Moreover, there is less bias in the data compared to the similar datasets, since the users did not try some fixed predefined rules in order to generate similar texts to their original inputs.