通过众包数据进行隐私政策分类的深度积极学习

论文标题

通过众包数据进行隐私政策分类的深度积极学习

Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

论文作者

Qiu, Wenjun, Lie, David

论文摘要

隐私政策是通知用户服务数据实践的陈述。但是，由于长度和复杂性，很少有用户愿意阅读策略文本。尽管存在基于机器学习的自动化工具进行隐私策略分析，以实现高分类精度，但需要在大型标记的数据集中对分类器进行培训。大多数现有的政策语料库都由熟练的人类注释者标记，需要大量的劳动时间和精力。在本文中，我们利用积极的学习和众包技术来开发一种名为Calpric的自动分类工具（众包积极学习隐私政策分类器），该工具能够执行与高准确性熟练的人类注释者所完成的注释，同时将标签成本最小化。具体而言，主动学习使分类器可以主动选择要标记的最有用的段。平均而言，我们的模型只能使用原始标签工作的62％获得相同的F1分数。 Calpric对主动学习的使用还解决了无标记的隐私政策数据集中自然发生的类不平衡，因为有更多的陈述说明私人信息收集比说明缺乏收集。通过从少数族裔类中选择样本进行标签，Calpric会自动创建更平衡的训练集。

Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most informative segments to be labeled. On average, our model is able to achieve the same F1 score using only 62% of the original labeling effort. Calpric's use of active learning also addresses naturally occurring class imbalance in unlabeled privacy policy datasets as there are many more statements stating the collection of private information than stating the absence of collection. By selecting samples from the minority class for labeling, Calpric automatically creates a more balanced training set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题