SVT：可扩展的视频到语音综合

论文标题

SVT：可扩展的视频到语音综合

SVTS: Scalable Video-to-Speech Synthesis

论文作者

Mira, Rodrigo, Haliassos, Alexandros, Petridis, Stavros, Schuller, Björn W., Pantic, Maja

论文摘要

视频到语音综合（也称为Lip-speech）是指沉默的唇部动作转换为相应的音频。由于其自我监督的性质（即可以在无需手动标记的情况下训练）以及在线可用的视听数据的收集不断增长，因此该任务受到了越来越多的关注。尽管有这些强大的动机，但现代视频到语音的工作主要集中在中小型语料库上，在词汇和环境中都有很大的限制。在这项工作中，我们引入了一个可扩展的视频到语音框架，该框架由两个组成部分组成：视频到光谱图预测指标和一个预训练的神经声码器，该框架将MEL频谱图转换为波形音频。我们在LRW上取得了最先进的效果，并且在LRW上的表现要优于以前的方法。更重要的是，通过使用简单的馈电模型专注于频谱图预测，我们可以有效地将方法扩展到非常不受约束的数据集：据我们所知，我们是第一个在具有挑战性的LRS3数据集上显示出可理解的结果。

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题