小脚印多通道转换器，用于关键字斑点的基于Centroid的意识

论文标题

小脚印多通道转换器，用于关键字斑点的基于Centroid的意识

Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness

论文作者

Ng, Dianwen, Pang, Jin Hui, Xiao, Yang, Tian, Biao, Fu, Qiang, Chng, Eng Siong

论文摘要

对于关键字斑点模型来说，具有较小的占地面积至关重要，因为它通常以低计算资源运行设备。但是，保持先前的SOTA性能以减小的模型大小具有挑战性。此外，具有多个信号干扰的远场和嘈杂的环境加剧了该问题，导致准确性显着降解。在本文中，我们为语音命令识别提供了一个多渠道交流器。这种新颖的体系结构在多通道音频设置中引入了用于通道音频交互的其他音频通道混合，以实现具有更有效的计算的更好的噪声功能。此外，我们提出了一种基于质心的意识组件，以通过在潜在特征投影空间中为其提供其他空间几何信息来增强系统。我们使用新的MISP挑战2021数据集评估了我们的模型。我们的模型对官方基线取得了显着改善，在原始麦克风阵列输入方面的竞争得分（0.152）中获得了55％的增长，并且在前端语音增强方面提高了63％（0.126）。

It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we present a multi-channel ConvMixer for speech command recognitions. The novel architecture introduces an additional audio channel mixing for channel audio interaction in a multi-channel audio setting to achieve better noise-robust features with more efficient computation. Besides, we proposed a centroid based awareness component to enhance the system by equipping it with additional spatial geometry information in the latent feature projection space. We evaluate our model using the new MISP challenge 2021 dataset. Our model achieves significant improvement against the official baseline with a 55% gain in the competition score (0.152) on raw microphone array input and a 63% (0.126) boost upon front-end speech enhancement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题