论文标题

从网络上积极学习

Active Learning from the Web

论文作者

Sato, Ryoma

论文摘要

标记数据是机器学习管道中最昂贵的过程之一。积极学习是减轻此问题的标准方法。基于池的主动学习首先构建了一个未标记的数据池,并迭代选择要标记的数据,以便将所需标签的总数最小化,从而保持模型性能高。文献中提出了许多从池中选择数据的有效标准。但是,如何探索如何建造池。具体而言,大多数方法都假定免费提供特定于任务的池。在本文中,我们提倡这种特定于任务的池并不总是可用的,并建议在网络上使用无数个未标记的数据用于应用主动学习的池。由于池非常大,因此可能存在许多任务中的相关数据,而且我们不需要明确设计和为每个任务构建池。面临的挑战是,由于池的大小,我们无法详尽地计算所有数据的采集评分。我们建议使用用户端信息检索算法从Web中从网络学习中检索信息的有效方法,以检索信息性数据。在实验中,我们将在线Flickr环境用作积极学习的池。该池包含超过1000亿张图像,并且比文献中的现有池大几个数量级。我们确认我们的方法的性能要比现有的使用小型未标记池的现有方法更好。

Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源