朝向多模式的视觉模型生成非传播文本

论文标题

朝向多模式的视觉模型生成非传播文本

Towards Multimodal Vision-Language Models Generating Non-Generic Text

论文作者

Robbins, Wes, Zohourianshahzadi, Zanyar, Kalita, Jugal

论文摘要

视觉模型可以评估图像中的视觉上下文并生成描述性文本。虽然生成的文本可能是准确且句法正确的，但通常过于笼统。为了解决这个问题，最近的工作使用了光学特征识别来补充从图像中提取的文本补充视觉信息。在这项工作中，我们认为，视觉模型可以受益于可以从图像中提取的其他信息，但当前模型不使用。我们修改了以前的多模式框架，以接受来自任意数量的辅助分类器的相关信息。特别是，我们将重点放在人名中，作为一组额外的令牌，并创建一个新颖的图像捕获数据集，以促进用人名称的字幕。标题（PAC）中的数据集，政客和运动员包括背景下知名人士的字幕图像。通过使用此数据集对经过预定的模型进行微调，我们演示了一个模型，该模型可以自然地将面部识别令牌纳入生成的文本中，通过培训有限的数据。对于PAC数据集，我们提供了有关集合和基线基准分数的讨论。

Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models. We modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers. In particular, we focus on person names as an additional set of tokens and create a novel image-caption dataset to facilitate captioning with person names. The dataset, Politicians and Athletes in Captions (PAC), consists of captioned images of well-known people in context. By fine-tuning pretrained models with this dataset, we demonstrate a model that can naturally integrate facial recognition tokens into generated text by training on limited data. For the PAC dataset, we provide a discussion on collection and baseline benchmark scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题