相似性寻找有效的积极学习和搜索稀有概念

论文标题

相似性寻找有效的积极学习和搜索稀有概念

Similarity Search for Efficient Active Learning and Search of Rare Concepts

论文作者

Coleman, Cody, Chou, Edward, Katz-Samuels, Julian, Culatana, Sean, Bailis, Peter, Berg, Alexander C., Nowak, Robert, Sumbaly, Roshan, Zaharia, Matei, Yalniz, I. Zeki

论文摘要

许多积极的学习和搜索方法对于有数十亿个未标记的例子的大规模工业环境非常有用。现有方法在全球搜索以标记的最佳示例，与未标记的数据进行线性或偶数缩放。在本文中，我们通过将候选池限制在当前标记的集合的最近邻居而不是对所有未标记的数据进行扫描来提高主动学习和搜索方法的计算效率。我们在三个大型计算机视觉数据集上评估了此设置中的几种选择策略：ImageNet，OpenImages，以及由一家大型互联网公司提供的100亿张图像的去识别和汇总的数据集。我们的方法达到的平均平均精度和回忆与传统的全球方法相似，同时最多将选择的计算成本降低了三个数量级，从而实现了Web规模的主动学习。

Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, thus enabling web-scale active learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题