面部关键点序列从音频产生

论文标题

面部关键点序列从音频产生

Facial Keypoint Sequence Generation from Audio

论文作者

Manocha, Prateek, Guha, Prithwijit

论文摘要

每当我们讲话时，我们的声音都会伴随着面部运动和表情。最近的一些作品显示了谈话面孔的高度逼真的视频的综合，但是它们要么需要源视频来驱动目标面孔，要么仅用固定的头部姿势生成视频。缺乏面部运动是因为这些作品中的大多数都集中在与音频同步的唇部运动上，同时假定其余的面部关键点的固定性质。为了解决这个问题，引入了一个独特的音频播放数据集，该数据集在224p和25fps中引入了150,000个视频，与给定音频的面部关键点运动相关联。然后，该数据集被进一步用于训练模型，Audio2Keypoint，这是一种合成面部关键点运动的新方法，可以使用音频。鉴于目标人的单个图像和音频序列（以任何语言），Audio2Keypoint与输入音频同步生成了合理的关键点运动序列，并以输入图像为条件，以保留目标人的面部特征。据我们所知，这是第一项提出音频播放数据集的工作，并学习了一个模型，以输出合理的关键点序列，以使用任何任意长度的音频。 Audio2Keypoint概括了具有不同面部结构的看不见的人，使我们能够用任何源甚至合成声音的语音生成序列。这项工作没有学习从音频到视频域的直接映射，而是旨在学习允许平面和平面外部旋转的音频映射，同时使用姿势不变（PIV）编码器来保留该人的身份。

Whenever we speak, our voice is accompanied by facial movements and expressions. Several recent works have shown the synthesis of highly photo-realistic videos of talking faces, but they either require a source video to drive the target face or only generate videos with a fixed head pose. This lack of facial movement is because most of these works focus on the lip movement in sync with the audio while assuming the remaining facial keypoints' fixed nature. To address this, a unique audio-keypoint dataset of over 150,000 videos at 224p and 25fps is introduced that relates the facial keypoint movement for the given audio. This dataset is then further used to train the model, Audio2Keypoint, a novel approach for synthesizing facial keypoint movement to go with the audio. Given a single image of the target person and an audio sequence (in any language), Audio2Keypoint generates a plausible keypoint movement sequence in sync with the input audio, conditioned on the input image to preserve the target person's facial characteristics. To the best of our knowledge, this is the first work that proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length. Audio2Keypoint generalizes across unseen people with a different facial structure allowing us to generate the sequence with the voice from any source or even synthetic voices. Instead of learning a direct mapping from audio to video domain, this work aims to learn the audio-keypoint mapping that allows for in-plane and out-of-plane head rotations, while preserving the person's identity using a Pose Invariant (PIV) Encoder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题