论文标题

给我看你的脸,我会告诉你你的说话

Show Me Your Face, And I'll Tell You How You Speak

论文作者

Millerdurai, Christen, Khaliq, Lotfy Abdel, Ulrich, Timon

论文摘要

当我们讲话时,可以从嘴唇的运动中推断出演讲的韵律和内容。在这项工作中,我们探讨了唇部对语音综合的任务,即,仅考虑说话者的嘴唇运动,我们将学习言语的唇部运动,我们专注于学习准确的唇部,以在不受限制的大型词汇环境中为多个说话者提供语音映射。我们通过其面部特征,即年龄,性别,种族和嘴唇动作来捕捉说话者的声音身份,即产生说话者身份的言语。为此,我们提出了一种新颖的方法“ lip2speech”,并具有关键的设计选择,以在不受约束的场景中实现准确的唇部唇部化。我们还使用定量,定性指标和人类评估进行了各种实验和广泛的评估。

When we speak, the prosody and content of the speech can be inferred from the movement of our lips. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker where we focus on learning accurate lip to speech mappings for multiple speakers in unconstrained, large vocabulary settings. We capture the speaker's voice identity through their facial characteristics, i.e., age, gender, ethnicity and condition them along with the lip movements to generate speaker identity aware speech. To this end, we present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios. We also perform various experiments and extensive evaluation using quantitative, qualitative metrics and human evaluation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源