对多恋人环境中基于DNN的视听语音增强的视觉特征的实证研究

论文标题

对多恋人环境中基于DNN的视听语音增强的视觉特征的实证研究

An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

论文作者

Shetu, Shrishti Saha, Chakrabarty, Soumitro, Habets, Emanuël A. P.

论文摘要

视听语音增强（AVSE）方法同时使用音频和视觉功能来增强语音，并且使用视觉功能的使用在多演讲者方案中特别有效。在大多数基于深神经网络（DNN）的AVSE方法中，音频和视觉数据首先使用不同的子网络分别处理，然后将学习的功能融合来利用这两种模态的信息。关于合适的音频输入功能和网络体系结构的各种研究，据我们所知，尚无公开的研究研究哪些视觉特征最适合这一特定任务。在这项工作中，我们对基于DNN的AVSE最常用的视觉特征，这些特征的预处理要求进行了经验研究，并研究了它们对性能的影响。我们的研究表明，尽管基于嵌入的功能的总体性能更好，但它们的计算深入预处理使它们在低资源系统中的使用困难。对于此类系统，可能更适合基于光流或原始像素的功能。

Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems. For such systems, optical flow or raw pixels-based features might be better suited.

下载PDF全文

下载文献需遵守相关版权规定

论文标题