使用预训练的图像发生器从语音音频说话的头

论文标题

使用预训练的图像发生器从语音音频说话的头

Talking Head from Speech Audio using a Pre-trained Image Generator

论文作者

Alghamdi, Mohammed M., Wang, He, Bulpitt, Andrew J., Hogg, David C.

论文摘要

我们提出了一种新颖的方法，用于生成语音音频和单个“身份”图像的高分辨率视频。我们的方法基于卷积神经网络模型，该模型包含了预训练的样式Generator。我们将每个帧建模为Stylegan潜在空间中的一个点，以便视频对应于潜在空间的轨迹。培训网络分为两个阶段。第一阶段是在言语话语中的潜在空间中模拟轨迹。为此，我们使用现有的编码器将发电机倒置，将每个视频框架映射到潜在空间中。我们训练一个经常性的神经网络，以从语音话语到图像发生器潜在空间中的位移。这些位移是相对于从训练数据集中所描绘的个体选择的身份图像的背部预测到潜在空间的。在第二阶段，我们通过在单个图像上调整图像发生器或任何选择的身份的简短视频来提高生成视频的视觉质量。我们在标准度量（PSNR，SSIM，FID和LMD）上评估了我们的模型，并表明它在两个常用数据集之一上的最新方法明显优于最新的最新方法，另一方面给出了可比的性能。最后，我们报告了验证模型组成部分的消融实验。可以在https://mohammedalghamdi.github.io/talking-heads-acm-mm上找到实验的代码和视频

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm

下载PDF全文

下载文献需遵守相关版权规定

论文标题