通过运动扩散在潜在空间中执行命令

论文标题

通过运动扩散在潜在空间中执行命令

Executing your Commands via Motion Diffusion in Latent Space

论文作者

Chen, Xin, Jiang, Biao, Liu, Wen, Huang, Zilong, Fu, Bin, Chen, Tao, Yu, Jingyi, Yu, Gang

论文摘要

我们研究一项具有挑战性的任务，有条件的人类运动产生，该任务根据各种条件输入（例如动作类别或文本描述符）产生合理的人类运动序列。由于人类的动作高度多样，并且具有与条件方式（例如自然语言的文本描述符）具有截然不同的特性，因此很难从所需的条件方式到人类运动序列学习概率映射。此外，来自运动捕获系统的原始运动数据可能是序列中的多余的，并包含噪音。直接对原始运动序列和条件方式进行关节分布进行建模将需要大量的计算开销，并可能导致捕获的噪声引入的伪影。为了更好地代表各种人类运动序列，我们首先设计了强大的变异自动编码器（VAE），并为人类运动序列提供了代表性和低维的潜在代码。然后，我们没有使用扩散模型来建立原始运动序列与条件输入之间的连接，而是在运动潜在空间上执行扩散过程。我们提出的基于运动潜在的扩散模型（MLD）可以产生符合给定条件输入的生动运动序列，并在训练和推理阶段大大减少计算开销。对各种人类运动任务的广泛实验表明，我们的MLD在广泛的人类运动生成任务中的最新方法上取得了重大改进，其两个数量级要比原始运动序列上的以前的扩散模型快两个。

We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题