通过语义嵌入的零拍音频分类

论文标题

通过语义嵌入的零拍音频分类

Zero-Shot Audio Classification via Semantic Embeddings

论文作者

Xie, Huang, Virtanen, Tuomas

论文摘要

在本文中，我们通过从文本标签和声音类的句子描述中提取的语义嵌入来研究音频分类中的零拍学习。我们的目标是获得一个能够识别没有可用训练样本的声音类的音频实例的分类器，但只有语义方面的信息。我们采用双线性兼容性框架来学习音频实例和声音类别的中级表示之间的声音 - 语义投影，即声学嵌入和语义嵌入。我们使用VGGISH从音频剪辑中提取深层的声学嵌入，以及预训练的语言模型（Word2Vec，Glove，Bert），从文本标签或句子嵌入的声音类别的句子嵌入中生成标签嵌入。音频分类是通过线性兼容函数进行的，该函数可以测量声学嵌入和语义嵌入的兼容性。我们在一个小的平衡数据集ESC-50和音频集的大规模不平衡音频子集上评估了所提出的方法。实验结果表明，分类性能通过在语义上接近训练中的测试类别的声音类别可显着提高。同时，我们证明了标签嵌入和句子嵌入对于零拍学习有用的。通过使用不同语言模型生成的标签/句子嵌入来提高分类性能。借助其杂种串联，结果进一步改善。

In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.

下载PDF全文

下载文献需遵守相关版权规定

论文标题