基准在文本到图像中的空间关系

论文标题

基准在文本到图像中的空间关系

Benchmarking Spatial Relationships in Text-to-Image Generation

论文作者

Gokhale, Tejas, Palangi, Hamid, Nushi, Besmira, Vineet, Vibhav, Horvitz, Eric, Kamar, Ece, Baral, Chitta, Yang, Yezhou

论文摘要

空间理解是计算机视觉的基本方面，也是关于图像的人类水平推理不可或缺的一部分，使其成为基础语言理解的重要组成部分。虽然最近的文本对图像合成（T2I）模型已经显示出了光真相的前所未有的改进，但尚不清楚它们是否具有可靠的空间理解能力。我们研究了T2I模型在对象和当前的遮阳板之间生成正确的空间关系的能力，该评估度量标准捕获了图像中文本中描述的空间关系的准确程度。为了实现现有模型，我们介绍了一个数据集，$ \ mathrm {sr} _ {2d} $，其中包含描述两个或多个对象及其之间的空间关系的句子。我们构建了一个自动评估管道，以识别对象及其空间关系，并将其用于T2I模型的大规模评估。我们的实验揭示了一个令人惊讶的发现，尽管最先进的T2i模型表现出较高的图像质量，但它们的产生多个对象或它们之间指定的空间关系的能力受到严重限制。我们的分析表明了T2I模型的几种偏见和伪像，例如产生多个对象的困难，产生提到的第一个对象的偏见，对等效关系的空间不一致的输出以及对象共同出现和空间理解能力之间的相关性。我们进行了一项人类研究，该研究表明遮阳板与人类对空间理解的判断之间的一致性。我们向社区提供$ \ mathrm {sr} _ {2d} $数据集，并向社区提供遮阳板指标，以支持T2I推理研究。

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to the community in support of T2I reasoning research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题