面对面++：唇部同步，语音保存视频翻译

论文标题

面对面++：唇部同步，语音保存视频翻译

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

论文作者

Waibel, Alexander, Behr, Moritz, Eyiokur, Fevziye Irem, Yaman, Dogucan, Nguyen, Tuan-Nam, Mullov, Carlos, Demirtas, Mehmet Arif, Kantarcı, Alperen, Constantin, Stefan, Ekenel, Hazım Kemal

论文摘要

在本文中，我们提出了一种神经端到端系统，用于保存视频的语音，唇部同步翻译。该系统旨在将多个组件模型结合在一起，并制作一个以目标语言为目标语言的原始扬声器的视频，该视频与目标语音相结合，但在语音，语音特征，面对原始扬声器的视频中保持重点。管道始于自动语音识别，包括重点检测，然后是翻译模型。然后，翻译后的文本由文本到语音模型合成，该模型重新创建了原始句子映射的原始重点。然后，使用语音转换模型将结果的合成语音映射回原始扬声器的声音。最后，为了将扬声器的嘴唇与翻译的音频同步，有条件的基于对抗网络的模型生成了相对于输入面图像以及语音转换模型的输出的适应性唇部运动的帧。最后，系统将生成的视频与转换后的音频结合在一起，以产生最终输出。结果是一个扬声器用另一种语言说话的视频而不真正知道。为了评估我们的设计，我们介绍了完整系统的用户研究以及对单个组件的单独评估。由于没有可用的数据集来评估我们的整个系统，因此我们收集了一个测试集并在此测试集上评估我们的系统。结果表明，我们的系统能够在保留原始说话者的特征的同时，制作出具有原始演讲者的令人信服的视频。收集的数据集将共享。

In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases mapped from the original sentence. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a conditional generative adversarial network-based model generates frames of adapted lip movements with respect to the input face image as well as the output of the voice conversion model. In the end, the system combines the generated video with the converted audio to produce the final output. The result is a video of a speaker speaking in another language without actually knowing it. To evaluate our design, we present a user study of the complete system as well as separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set and evaluate our system on this test set. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics. The collected dataset will be shared.

下载PDF全文

下载文献需遵守相关版权规定

论文标题