通过视觉计划和令牌对齐方式以角色为中心的故事可视化

论文标题

通过视觉计划和令牌对齐方式以角色为中心的故事可视化

Character-Centric Story Visualization via Visual Planning and Token Alignment

论文作者

Chen, Hong, Han, Rujun, Wu, Te-Lin, Nakayama, Hideki, Peng, Nanyun

论文摘要

故事可视化通过基于完整的故事来启用多个图像生成，从而推动传统的文本对图像生成。此任务需要机器至1）了解长文本输入，2）产生一个全球一致的图像序列，以说明故事的内容。一致故事可视化的一个关键挑战是保留故事中必不可少的角色。为了应对挑战，我们建议改编一项最新的工作，以增强矢量量化的变分自动编码器（VQ-VAE），并使用文本 - 倾斜（变形金刚）体系结构。具体而言，我们使用两个阶段框架修改文本对视觉模块：1）字符令牌计划模型，该模型仅预测字符的视觉令牌； 2）生成剩余的视觉令牌序列的视觉令牌完成模型，该模型将发送到VQ-VAE以进行最终化图像世代。为了鼓励角色出现在图像中，我们进一步训练了两个阶段的框架，并以角色对齐目标。广泛的实验和评估表明，所提出的方法在保存特征方面表现出色，并且与强基础相比可以产生更高质量的图像序列。代码可以在https://github.com/sairin1202/vp-csv中找到

Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story. This task requires machines to 1) understand long text inputs and 2) produce a globally consistent image sequence that illustrates the contents of the story. A key challenge of consistent story visualization is to preserve characters that are essential in stories. To tackle the challenge, we propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders (VQ-VAE) with a text-tovisual-token (transformer) architecture. Specifically, we modify the text-to-visual-token module with a two-stage framework: 1) character token planning model that predicts the visual tokens for characters only; 2) visual token completion model that generates the remaining visual token sequence, which is sent to VQ-VAE for finalizing image generations. To encourage characters to appear in the images, we further train the two-stage framework with a character-token alignment objective. Extensive experiments and evaluations demonstrate that the proposed method excels at preserving characters and can produce higher quality image sequences compared with the strong baselines. Codes can be found in https://github.com/sairin1202/VP-CSV

下载PDF全文

下载文献需遵守相关版权规定

论文标题