CAA-NET：有条件的非常有用的CNN，并注意可解释的设备可解释的声学场景分类

论文标题

CAA-NET：有条件的非常有用的CNN，并注意可解释的设备可解释的声学场景分类

CAA-Net: Conditional Atrous CNNs with Attention for Explainable Device-robust Acoustic Scene Classification

论文作者

Ren, Zhao, Kong, Qiuqiang, Han, Jing, Plumbley, Mark D., Schuller, Björn W.

论文摘要

声学场景分类（ASC）旨在对记录音频信号的环境进行分类。最近，卷积神经网络（CNN）已成功应用于ASC。但是，使用多个设备记录的音频信号的数据分布不同。几乎没有研究在带有多个设备的声学场景数据集中训练强大的神经网络，以及解释神经网络内部层的操作。在本文中，我们专注于培训和解释多设备声学场景数据的设备刺激性CNN。我们提出有条件性的CNN，并注意多设备ASC。我们提出的系统包含ASC分支和设备分类分支，均由CNN建模。我们可视化和分析了极为CNN的中间层。采用时频注意机制来分析CNN中特征图的每个时间频箱的贡献。关于使用三种设备记录的声学场景和事件的检测和分类（DCASE）2018 ASC数据集，我们的建议模型的性能明显优于对单个设备数据训练的CNN。

Acoustic Scene Classification (ASC) aims to classify the environment in which the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs) have been successfully applied to ASC. However, the data distributions of the audio signals recorded with multiple devices are different. There has been little research on the training of robust neural networks on acoustic scene datasets recorded with multiple devices, and on explaining the operation of the internal layers of the neural networks. In this article, we focus on training and explaining device-robust CNNs on multi-device acoustic scene data. We propose conditional atrous CNNs with attention for multi-device ASC. Our proposed system contains an ASC branch and a device classification branch, both modelled by CNNs. We visualise and analyse the intermediate layers of the atrous CNNs. A time-frequency attention mechanism is employed to analyse the contribution of each time-frequency bin of the feature maps in the CNNs. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 ASC dataset, recorded with three devices, our proposed model performs significantly better than CNNs trained on single-device data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题