论文标题
重新考虑用于音频分类的CNN模型
Rethinking CNN Models for Audio Classification
论文作者
论文摘要
在本文中,我们表明,Imagenet预测的标准深CNN模型可以用作音频分类的强基线网络。即使音频谱图和标准成像类图像样本之间存在显着差异,但传递学习假设仍然牢固地保持。为了了解如何使ImageNet预处理模型学习有用的音频表示形式,我们系统地研究了多少预处理的权重可用于学习光谱图。我们表明(1)对于给定的标准模型而言,使用预审计的权重要比使用随机初始化的权重(2)CNN通过可视化梯度从频谱图中学到的定性结果。此外,我们表明,即使我们使用验证的模型权重进行初始化,但同一模型的各种输出运行中的性能也存在差异。性能方面的这种差异是由于多个运行中线性分类层的随机初始化和随机的迷你批量订购。这带来了巨大的多样性,以建立更强大的合奏模型,并具有整体的准确性提高。 Imagenet预告片的集合在ESC-50数据集上实现了92.89%的验证精度,在URBANSOUND8K数据集上,这是两个数据集中当前最新的验证精度,这是urbansound8k数据集的87.42%验证精度。
In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we systematically study how much of pretrained weights is useful for learning spectrograms. We show (1) that for a given standard model using pretrained weights is better than using randomly initialized weights (2) qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients. Besides, we show that even though we use the pretrained model weights for initialization, there is variance in performance in various output runs of the same model. This variance in performance is due to the random initialization of linear classification layer and random mini-batch orderings in multiple runs. This brings significant diversity to build stronger ensemble models with an overall improvement in accuracy. An ensemble of ImageNet pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset and 87.42% validation accuracy on the UrbanSound8K dataset which is the current state-of-the-art on both of these datasets.