clip2-litatent：使用DeNoising扩散和夹子的预训练样式的文本驱动样本

论文标题

clip2-litatent：使用DeNoising扩散和夹子的预训练样式的文本驱动样本

clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

论文作者

Pinkney, Justin N. M., Li, Chuan

论文摘要

我们介绍了一种新方法，可以从预先训练的剪辑和StyleGAN中有效地创建文本对图像模型。它可以使用现有的生成模型进行文本驱动的采样，而无需任何外部数据或微调。这是通过训练在夹子嵌入条件的扩散模型来实现的，以采样预训练样式的潜在向量，我们称之为clip2latent。我们利用剪辑的图像和文本嵌入之间的对齐方式，以避免需要任何文本标记的数据以训练条件扩散模型。我们证明了clip2-latent允许我们根据文本提示生成高分辨率（1024x1024像素）图像，并具有快速采样，高图像质量以及低训练的计算和数据要求。我们还表明，使用经过良好研究的StyleGAN体系结构的使用而无需进行进一步的微调，使我们能够直接应用现有的方法来控制和修改生成的图像，从而为我们的文本到图像管道添加了进一步的控制层。

We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. This is achieved by training a diffusion model conditioned on CLIP embeddings to sample latent vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model. We demonstrate that clip2latent allows us to generate high-resolution (1024x1024 pixels) images based on text prompts with fast sampling, high image quality, and low training compute and data requirements. We also show that the use of the well studied StyleGAN architecture, without further fine-tuning, allows us to directly apply existing methods to control and modify the generated images adding a further layer of control to our text-to-image pipeline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题