迅速使用交叉注意控制的图像编辑

论文标题

迅速使用交叉注意控制的图像编辑

Prompt-to-Prompt Image Editing with Cross Attention Control

论文作者

Hertz, Amir, Mokady, Ron, Tenenbaum, Jay, Aberman, Kfir, Pritch, Yael, Cohen-Or, Daniel

论文摘要

最近，大规模的文本驱动综合模型由于其出色地产生高度多样的图像而引起了很多关注，这些图像遵循给定的文本提示。这种基于文本的综合方法特别有吸引力，这些人类被用来口头描述其意图的人类。因此，将文本驱动的图像合成扩展到文本驱动的图像编辑是很自然的。编辑对于这些生成模型来说是具有挑战性的，因为编辑技术的先天属性是保留大多数原始图像，而在基于文本的模型中，即使对文本提示的小修改也通常会导致完全不同的结果。最先进的方法可以通过要求用户提供空间掩码来定位编辑，从而忽略掩盖区域内的原始结构和内容，从而减轻了这种方法。在本文中，我们追求一个直观的及时提示编辑框架，其中编辑仅由文本控制。为此，我们深入分析了一个文本条件模型，并观察到跨注意层是控制图像的空间布局与提示中每个单词之间关系的关键。通过此观察，我们提出了几个应用程序，它们仅通过编辑文本提示来监视图像综合。这包括通过替换单词，通过添加规范来替换单词编辑的本地化编辑，甚至精心控制单词在图像中反映的程度。我们介绍了各种图像和提示的结果，证明了高质量的综合和对编辑的提示的忠诚。

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题