CheckSel：通过在线检查点选择的高效，准确的数据评估

论文标题

CheckSel：通过在线检查点选择的高效，准确的数据评估

CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

论文作者

Das, Soumi, Sagarkar, Manasvi, Bhattacharya, Suparna, Bhattacharya, Sourangshu

论文摘要

数据估值和子集选择已成为重要的工具，用于针对重要的重要培训数据选择。但是，最先进的方法的效率准确性权衡将其广泛应用于许多AI工作流程。在本文中，我们针对此问题提出了一种新颖的2阶段解决方案。第1阶段从类似SGD的训练算法中选择代表性检查点，这些检查点在阶段2中用于估计近似训练数据值，例如减少由于每个训练点而导致的验证损失。本文的一个关键贡献是CheckSel，这是一种正交匹配的追击启发的在线稀疏近似算法，用于在线设置中选择检查点，其中一次功能一次显示。另一个关键贡献是研究域自适应设置中的数据评估，其中使用源域培训数据集中使用训练轨迹的检查点获得的数据值估计器用于目标域培训数据集中的数据评估。基准数据集的实验结果表明，对于独立和域适应设置，所提出的算法在测试准确性的同时，在产生类似的计算负担的同时，提出的算法的表现高达30％。

Data valuation and subset selection have emerged as valuable tools for application-specific selection of important training data. However, the efficiency-accuracy tradeoffs of state-of-the-art methods hinder their widespread application to many AI workflows. In this paper, we propose a novel 2-phase solution to this problem. Phase 1 selects representative checkpoints from an SGD-like training algorithm, which are used in phase-2 to estimate the approximate training data values, e.g. decrease in validation loss due to each training point. A key contribution of this paper is CheckSel, an Orthogonal Matching Pursuit-inspired online sparse approximation algorithm for checkpoint selection in the online setting, where the features are revealed one at a time. Another key contribution is the study of data valuation in the domain adaptation setting, where a data value estimator obtained using checkpoints from training trajectory in the source domain training dataset is used for data valuation in a target domain training dataset. Experimental results on benchmark datasets show the proposed algorithm outperforms recent baseline methods by up to 30% in terms of test accuracy while incurring a similar computational burden, for both standalone and domain adaptation settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题