稳定面：分析和改善说话面对面的运动稳定性

论文标题

稳定面：分析和改善说话面对面的运动稳定性

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

论文作者

Ling, Jun, Tan, Xu, Chen, Liyang, Li, Runnan, Zhang, Yuchao, Zhao, Sheng, Song, Li

论文摘要

虽然先前以语音为导向的说话面部生成方法在改善合成视频的视觉质量和唇部同步质量方面取得了重大进展，但他们对唇部运动的关注较少，这极大地破坏了说话面部视频的现实。是什么导致运动烦恼，以及如何减轻问题？在本文中，我们基于最先进的管道对运动抖动问题进行系统分析，该管道使用3D脸表示桥接输入音频和输出视频，并通过一系列有效的设计提高运动稳定性。我们发现，几个问题可能会导致综合说话的面部视频中的烦恼：1）输入3D面部表示的烦恼； 2）训练推导不匹配； 3）视频帧之间缺乏依赖性建模。因此，我们提出了三种有效的解决方案来解决此问题：1）我们提出了一个基于高斯的自适应平滑模块，以使3D面表示以消除输入中的抖动； 2）我们在训练中对神经渲染器的输入数据增加了增强的侵蚀，以模拟推理中的变形以减少不匹配； 3）我们开发了一个音频融合的变压器生成器，以模拟视频帧之间的依赖性。此外，考虑到在说话面部视频中没有用于测量运动抖动的现成的度量，我们设计了一个客观的度量标准（运动稳定性指数，MSI），可以通过计算方差加速度的倒数来定量测量运动抖动。广泛的实验结果表明，我们方法对运动稳定的面部视频生成的优越性，其质量比以前的系统更好。

While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题