通过对抗匹配匹配来解决缺失的资源

论文标题

通过对抗匹配匹配来解决缺失的资源

Addressing Missing Sources with Adversarial Support-Matching

论文作者

Kehrenberg, Thomas, Bartlett, Myles, Sharmanska, Viktoriia, Quadrianto, Novi

论文摘要

当对不同标签的数据进行培训时，机器学习模型已证明自己是社会各个方面的强大工具。但是，由于预算限制，故意或非自由主义的审查制度以及数据收集和策展期间的其他问题，标记的培训集可能会显示某些组的数据短缺。我们研究了一个方案，其中缺少某些数据与数据中的两级层次结构的第二级链接。受算法公平的保护群体的想法的启发，我们将第二层雕刻的分区称为“亚组”。我们将亚组和类或层次结构的叶子的组合称为“来源”。为了表征问题，我们介绍了具有不完整子组支持的类概念。训练集中的代表性偏见会导致类和亚组之间的虚假相关性，这些偏差使标准分类模型无法看见来源。为了克服这一偏见，我们利用一个称为“部署集”的附加，多样但未标记的数据集，以学习一个不变的表示子组。这是通过对手匹配表示空间中培训和部署集的支持来完成的。为了学习所需的不变性，至关重要的是，歧视者观察到的样本集由阶级平衡。对于培训集很容易实现这一目标，但是需要使用半监督的聚类进行部署集。我们通过在该问题的几个数据集和变体上进行实验来证明我们方法的有效性。

When trained on diverse labeled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection and curation, the labeled training set might exhibit a systematic shortage of data for certain groups. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected groups from algorithmic fairness, we refer to the partitions carved by this second level as "subgroups"; we refer to combinations of subgroups and classes, or leaves of the hierarchy, as "sources". To characterize the problem, we introduce the concept of classes with incomplete subgroup support. The representational bias in the training set can give rise to spurious correlations between the classes and the subgroups which render standard classification models ungeneralizable to unseen sources. To overcome this bias, we make use of an additional, diverse but unlabeled dataset, called the "deployment set", to learn a representation that is invariant to subgroup. This is done by adversarially matching the support of the training and deployment sets in representation space. In order to learn the desired invariance, it is paramount that the sets of samples observed by the discriminator are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method with experiments on several datasets and variants of the problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题