论文标题
半参数有效方法,用于标记偏移估计和定量
A Semiparametric Efficient Approach To Label Shift Estimation and Quantification
论文作者
论文摘要
转移学习是一个统计和机器学习研究领域,它寻求以下问题的答案:当可用于培训的数据时,我们如何构建成功的学习算法,我们的模型与希望模型能够良好的数据在质量上不同?在本论文中,我们关注的是转移学习的特定领域,称为标签转移,也称为定量。在定量中,将上述差异分离为响应变量分布的变化。在这种情况下,准确地推断响应变量的新分布既是其本身的重要估计任务,又是确保学习算法可以适应新数据的关键步骤。我们为这一领域做出了两个贡献。首先,我们提出了一个名为SELSE的新程序,该程序估计了响应变量分布的变化。其次,我们证明SELSE在大型定量算法中是半参数有效的,即Selse的归一化误差具有最小的渐近方差矩阵,与该家族中的任何其他算法相比。该家族几乎包括所有现有算法,包括ACC/PACC量词和最大似然量词,例如EMQ和MLL。经验实验表明,SELSE与现有的最先进的定量方法具有竞争力,并且在许多情况下,当测试样品的数量远大于火车样本数量时,这种改进尤其大。
Transfer Learning is an area of statistics and machine learning research that seeks answers to the following question: how do we build successful learning algorithms when the data available for training our model is qualitatively different from the data we hope the model will perform well on? In this thesis, we focus on a specific area of Transfer Learning called label shift, also known as quantification. In quantification, the aforementioned discrepancy is isolated to a shift in the distribution of the response variable. In such a setting, accurately inferring the response variable's new distribution is both an important estimation task in its own right and a crucial step for ensuring that the learning algorithm can adapt to the new data. We make two contributions to this field. First, we present a new procedure called SELSE which estimates the shift in the response variable's distribution. Second, we prove that SELSE is semiparametric efficient among a large family of quantification algorithms, i.e., SELSE's normalized error has the smallest possible asymptotic variance matrix compared to any other algorithm in that family. This family includes nearly all existing algorithms, including ACC/PACC quantifiers and maximum likelihood based quantifiers such as EMQ and MLLS. Empirical experiments reveal that SELSE is competitive with, and in many cases outperforms, existing state-of-the-art quantification methods, and that this improvement is especially large when the number of test samples is far greater than the number of train samples.