将子字关节与唇形相关联，以嵌入意识的视听语音增强

论文标题

将子字关节与唇形相关联，以嵌入意识的视听语音增强

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

论文作者

Chen, Hang, Du, Jun, Hu, Yu, Dai, Li-Rong, Yin, Bao-Cai, Lee, Chin-Hui

论文摘要

在本文中，我们提出了一种视觉嵌入方法，以通过在电话和发音级别的位置同步视觉唇部框架来改善嵌入嵌入式语音增强（易于）。我们首先使用预训练的手机或表达位置识别器以纯粹的易于易于视觉（速度）提取视觉嵌入。接下来，我们利用音频和视觉特征的互补性来以多种模式的轻松（mease）提取嘈杂的语音和唇部视频的视听嵌入。模拟添加剂噪声损坏的TCD-Timit语料库的实验表明，我们提出的基于子字的发射方法比在单词级别上的常规嵌入更有效。此外，在发音地点层面上的视觉嵌入，并利用铰接位置和唇形的高度相关性，表现出比在电话级别上更好的性能。最后，与使用最佳的仅视觉纯粹和音频轻松系统获得的音频和视觉嵌入的拟议的Mease框架相比，言语质量和清晰度的效果明显优于言语和清晰度。

In this paper, we propose a visual embedding approach to improving embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE). Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that our proposed subword based VEASE approach is more effective than conventional embedding at the word level. Moreover, visual embedding at the articulation place level, leveraging upon a high correlation between place of articulation and lip shapes, shows an even better performance than that at the phone level. Finally the proposed MEASE framework, incorporating both audio and visual embedding, yields significantly better speech quality and intelligibility than those obtained with the best visual-only and audio-only EASE systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题