论文标题
正确完成的录音:第二比较深度学习方法的环境声音分类
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
论文作者
论文摘要
在视觉和语言任务方面取得了巨大的成功之后,纯粹的基于注意力的神经体系结构(例如DEIT)正出现到音频标记(AT)排行榜的顶部,该排行榜似乎昏迷了传统的卷积神经网络(CNN),进料前进的网络或反复的网络。但是,仔细观察,已发表的研究有很大的差异,例如,以预审计的重量初始化的模型的表现与没有预算时的模型截然不同,模型的训练时间从小时到几周不等,并且通常隐藏在看似琐碎的细节中。由于我们的第一个比较已久了,所以这迫切需要进行全面的研究。在这项工作中,我们对Audioset进行了广泛的实验,这是最大的标记声音事件数据集,我们还根据数据质量和效率进行了分析。我们比较了AT任务上的一些最先进的基线,并研究了2个主要类别的神经体系结构的性能和效率:CNN变体和基于注意力的变体。我们还仔细检查了他们的优化程序。我们开放的实验结果为从业者和研究人员提供了绩效,效率,优化过程之间的权衡方面的见解。实施:https://github.com/lijuncheng16/audiotaggingdoneright
After its sweeping success in vision and language tasks, pure attention-based neural architectures (e.g. DeiT) are emerging to the top of audio tagging (AT) leaderboards, which seemingly obsoletes traditional convolutional neural networks (CNNs), feed-forward networks or recurrent networks. However, taking a closer look, there is great variability in published research, for instance, performances of models initialized with pretrained weights differ drastically from without pretraining, training time for a model varies from hours to weeks, and often, essences are hidden in seemingly trivial details. This urgently calls for a comprehensive study since our 1st comparison is half-decade old. In this work, we perform extensive experiments on AudioSet which is the largest weakly-labeled sound event dataset available, we also did an analysis based on the data quality and efficiency. We compare a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants. We also closely examine their optimization procedures. Our opensourced experimental results provide insights to trade-off between performance, efficiency, optimization process, for both practitioners and researchers. Implementation: https://github.com/lijuncheng16/AudioTaggingDoneRight