论文标题

从不切实际的刺激中学习现实的模式:概括和数据匿名化

Learning Realistic Patterns from Unrealistic Stimuli: Generalization and Data Anonymization

论文作者

Nikolaidis, Konstantinos, Kristiansen, Stein, Plagemann, Thomas, Goebel, Vera, Liestøl, Knut, Kankanhalli, Mohan, Traaen, Gunn Marit, Øverland, Britt, Akre, Harriet, Aakerøy, Lars, Steinshamn, Sigurd

论文摘要

良好的培训数据是开发有用的ML应用程序的先决条件。但是,在许多领域中,由于隐私法规(例如,从医学研究中),现有数据集无法共享。这项工作研究了一种简单但非常规的方法,用于匿名数据综合,使第三方能够从此类私人数据中受益。我们探讨了从不现实的,与任务相关的刺激中隐含地学习的可行性,这些刺激是通过激发训练有素的深神经网络(DNN)的神经元综合的。因此,神经元激发可作为伪生成模型。刺激数据用于训练新的分类模型。此外,我们将此框架扩展到抑制与特定个人相关的表示。我们使用开放式和大型封闭临床研究的睡眠监控数据,并评估(1)最终用户是否可以创建并成功使用自定义的分类模型进行睡眠呼吸暂停检测,并且(2)研究中参与者的身份受到保护。广泛的比较实证研究表明,在刺激上训练的不同算法能够成功地在与原始模型相同的任务上成功概括。但是,新模型和原始模型之间的建筑和算法相似性在性能中起着重要作用。对于类似的架构,性能接近使用真实数据(例如,准确性差为0.56 \%,KAPPA系数差为0.03-0.04)。进一步的实验表明,刺激可以在很大程度上成功地将临床研究的参与者匿名化。

Good training data is a prerequisite to develop useful ML applications. However, in many domains existing data sets cannot be shared due to privacy regulations (e.g., from medical studies). This work investigates a simple yet unconventional approach for anonymized data synthesis to enable third parties to benefit from such private data. We explore the feasibility of learning implicitly from unrealistic, task-relevant stimuli, which are synthesized by exciting the neurons of a trained deep neural network (DNN). As such, neuronal excitation serves as a pseudo-generative model. The stimuli data is used to train new classification models. Furthermore, we extend this framework to inhibit representations that are associated with specific individuals. We use sleep monitoring data from both an open and a large closed clinical study and evaluate whether (1) end-users can create and successfully use customized classification models for sleep apnea detection, and (2) the identity of participants in the study is protected. Extensive comparative empirical investigation shows that different algorithms trained on the stimuli are able generalize successfully on the same task as the original model. However, architectural and algorithmic similarity between new and original models play an important role in performance. For similar architectures, the performance is close to that of using the true data (e.g., Accuracy difference of 0.56\%, Kappa coefficient difference of 0.03-0.04). Further experiments show that the stimuli can to a large extent successfully anonymize participants of the clinical studies.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源