无监督的文本去识别

论文标题

无监督的文本去识别

Unsupervised Text Deidentification

论文作者

Morris, John X., Chiu, Justin T., Zabih, Ramin, Rush, Alexander M.

论文摘要

去识别试图在分发之前对文本数据进行匿名化。自动去识别主要使用人类标记的数据点监督的命名实体识别。我们提出了一种无监督的去识别方法，该方法掩盖了泄漏个人识别信息的单词。该方法利用经过特殊训练的重新识别模型来识别从编辑的个人文件中识别个人。由基于K-匿名性的隐私动机，我们生成的修复，以确保对文档的正确配置文件的最低重新识别等级。为了评估这种方法，我们考虑了去识别维基百科传记的任务，并使用对抗性重新识别度量标准进行评估。与一组无监督的基线相比，我们的方法更彻底地识别了文档，同时删除了更少的单词。从定性上讲，我们看到该方法消除了许多识别基于命名实体的方法的识别方面。

Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many identifying aspects that would fall outside of the common named entity based approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题