MISP挑战2021的视听唤醒单词斑点系统

论文标题

MISP挑战2021的视听唤醒单词斑点系统

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

论文作者

Xu, Yanguang, Sun, Jianwei, Han, Yang, Zhao, Shuaijiang, Mei, Chaoyang, Guo, Tingwei, Zhou, Shuran, Xie, Chuandong, Zou, Wei, Li, Xiangang, Zhou, Shuran, Xie, Chuandong, Zou, Wei, Li, Xiangang

论文摘要

本文介绍了我们系统为基于多模式信息的语音处理（MISP）挑战2021的任务1设计的详细信息。在提出的系统中，首先，我们利用语音增强算法（例如波束形成和加权预测误差（WPE））来解决多微晶对话音频。其次，应用了几种数据增强技术来模拟更现实的远场景。对于视频信息，提供的感兴趣区域（ROI）用于获得视觉表示。然后提出了多层CNN来学习音频和视觉表示，并且这些表示形式被馈入我们的基于双分支注意力的网络中，该网络可用于融合，例如变压器和符合。焦点损失用于微调模型并显着提高性能。最后，通过投票投票以达到我们最终的0.091分数来整合多个训练有素的模型。

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题