论文标题
在stylegan2潜在空间中编码的表达性会说话的头视频
Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space
论文作者
论文摘要
虽然视频重演研究的最新进展已经产生了令人鼓舞的结果,但这些方法在捕获精细,详尽和表现力的面部特征(例如,唇纹,口腔皱纹,嘴巴张开和皱纹)方面缺乏,这对于产生逼真的动画视频至关重要。为此,我们提出了一种端到端表达视频编码方法,该方法通过优化单个身份范围的低维编辑来促进数据有效的高质量视频重新合成。该方法建立在StyleGAN2图像反演和多阶段的非线性潜在空间编辑的基础上,以生成与输入视频几乎相当的视频。尽管现有的基于stylegan潜在的编辑技术着重于简单地生成静态图像的合理编辑,但我们使用位于样式的lantent-lastent Space(stylespace)stylegan2的编码序列中自动化潜在空间编辑,以捕获一系列帧中的精细表达式面部变形。因此,获得的编码可以超过单个身份范围,以促进面部视频以$ 1024^2 $的重新制定。所提出的框架在经济上捕获了良好水平的面部身份,头部姿势和复杂表达性的面部运动,从而绕过训练,人建模,对地标/关键点的依赖以及低分辨率的合成,这些合成往往会阻碍大多数重新制定方法。该方法的设计具有最大的数据效率,其中单个$ W+$潜在和每个框架35个参数启用高保真视频渲染。该管道也可以用于伪造(即运动转移)。
While the recent advances in research on video reenactment have yielded promising results, the approaches fall short in capturing the fine, detailed, and expressive facial features (e.g., lip-pressing, mouth puckering, mouth gaping, and wrinkles) which are crucial in generating realistic animated face videos. To this end, we propose an end-to-end expressive face video encoding approach that facilitates data-efficient high-quality video re-synthesis by optimizing low-dimensional edits of a single Identity-latent. The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos. While existing StyleGAN latent-based editing techniques focus on simply generating plausible edits of static images, we automate the latent-space editing to capture the fine expressive facial deformations in a sequence of frames using an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2. The encoding thus obtained could be super-imposed on a single Identity-latent to facilitate re-enactment of face videos at $1024^2$. The proposed framework economically captures face identity, head-pose, and complex expressive facial motions at fine levels, and thereby bypasses training, person modeling, dependence on landmarks/ keypoints, and low-resolution synthesis which tend to hamper most re-enactment approaches. The approach is designed with maximum data efficiency, where a single $W+$ latent and 35 parameters per frame enable high-fidelity video rendering. This pipeline can also be used for puppeteering (i.e., motion transfer).