策展人：使用自我监督的学习创建大规模策划的标签数据集

论文标题

策展人：使用自我监督的学习创建大规模策划的标签数据集

Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning

论文作者

Narayanan, Tarun, Krishnan, Ajay, Koul, Anirudh, Ganju, Siddha

论文摘要

尽管在此类域中有大量的原始数据，但缺乏标记的数据，将机器学习应用于地球科学等领域受到阻碍。例如，在卫星图像上训练野火分类器需要策划一个庞大而多样化的数据集，这是一个昂贵且耗时的过程，可以跨越几周到几个月。在40多个未标记的数据中搜索相关示例需要研究人员手动寻找此类图像，就像在干草堆中找到针一样。我们提出了一个无代码的端到端管道策展人，该管道大大最大程度地减少了策划详尽标记的数据集所需的时间。策展人能够通过组合自学，可扩展的邻居搜索以及主动学习来学习和区分图像表示，可以搜索大量未标记的数据。该管道也可以很容易地应用于解决不同域之间的问题。总体而言，该管道使研究人员可以在少量时间内从一个参考图像转变为综合数据集成为实用性。

Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires curating a massive and diverse dataset, which is an expensive and time-consuming process that can span from weeks to months. Searching for relevant examples in over 40 petabytes of unlabelled data requires researchers to manually hunt for such images, much like finding a needle in a haystack. We present a no-code end-to-end pipeline, Curator, which dramatically minimizes the time taken to curate an exhaustive labeled dataset. Curator is able to search massive amounts of unlabelled data by combining self-supervision, scalable nearest neighbor search, and active learning to learn and differentiate image representations. The pipeline can also be readily applied to solve problems across different domains. Overall, the pipeline makes it practical for researchers to go from just one reference image to a comprehensive dataset in a diminutive span of time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题