有监督和无监督的音频表示以进行音乐理解

论文标题

有监督和无监督的音频表示以进行音乐理解

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

论文作者

McCallum, Matthew C., Korzeniowski, Filip, Oramas, Sergio, Gouyon, Fabien, Ehmann, Andreas F.

论文摘要

在这项工作中，我们对音乐领域中多个任务的预训练音频理解模型的策略进行了广泛的比较分析，包括类型，ERA，Origin，Moot，Moot，Moot，仪器，仪器，钥匙，音高，声音特征，节奏和声音的标签。具体而言，我们探讨了预训练数据集（音乐或通用音频）的域以及预训练方法（监督或不监督的）如何影响所得的音频嵌入到下游任务的足够性。我们表明，通过对大规模专家注销的音乐数据集进行监督学习训练的模型在各种音乐标签任务中实现了最先进的表现，每个任务都具有新颖的内容和词汇。这可以通过包含少于1亿个参数的模型来有效地完成，这些模型无需进行微调或重新聚集，以使下游任务对行业规模的音频目录进行实用。在无监督的学习策略的类别中，我们表明培训数据集的领域可以显着影响模型所学的表示的表现。我们发现，将预训练数据集的领域限制在音乐中，可以用较小的批量大小进行培训，同时在无监督的学习中实现最新的学习 - 以及在某些情况下，有些有监督的学习 - 以了解音乐。我们还证实了这一点，尽管在许多任务上实现了最新的绩效，但监督的学习可能会导致模型专门针对所提供的监督信息，这在某种程度上损害了模型的一般性。

In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题