剪贴画：学习文本引人入胜的声音分离，嘈杂的未标记视频

论文标题

剪贴画：学习文本引人入胜的声音分离，嘈杂的未标记视频

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

论文作者

Dong, Hao-Wen, Takahashi, Naoya, Mitsufuji, Yuki, McAuley, Julian, Berg-Kirkpatrick, Taylor

论文摘要

近年来，除了特定于域的声音分离以外，言语或音乐向通用声音分离以外的进展，以进行任意声音。关于文本查询，对通用声音分离的先前工作已经研究了将目标声音从音频混合物中分离出来的。这种引人入胜的声音分离系统为指定任意目标声音提供了自然而可扩展的接口。但是，被监督的文本引人注目的声音分离系统需要昂贵的标记音频对培训。此外，现有数据集中提供的音频通常在受控的环境中记录，从而导致野外嘈杂的音频有很大的概括差距。在这项工作中，我们旨在仅使用未标记的数据来处理文本引人注目的通用声音分离。我们建议利用视觉模态作为学习所需的音频文本对应的桥梁。提出的剪贴器模型首先使用对比度语言图像预处理（剪辑）模型将输入查询编码为查询向量，然后使用查询向量来调节音频分离模型以分离目标声音。尽管该模型是通过未标记的视频提取的图像原告对训练的，但在测试时，我们可以在零弹奏设置中使用文本输入来查询模型，这要归功于剪辑模型所学的联合语言图像嵌入。此外，野外的视频通常包含屏幕外声音和背景噪音，可能会阻碍该模型学习所需的音频文本对应。为了解决这个问题，我们进一步提出了一种称为噪声不变训练的方法，用于训练嘈杂数据的基于查询的声音分离模型。实验结果表明，拟议的模型仅使用嘈杂的未标记视频成功地学习了文本引人入胜的通用声音分离，甚至在某些情况下还可以针对监督模型实现竞争性能。

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题