在视觉问题回答中重新评估评估实践：关于分布概括的案例研究

论文标题

在视觉问题回答中重新评估评估实践：关于分布概括的案例研究

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

论文作者

Agrawal, Aishwarya, Kajić, Ivana, Bugliarello, Emanuele, Davoodi, Elnaz, Gergely, Anita, Blunsom, Phil, Nematzadeh, Aida

论文摘要

在大规模多模式数据上预测的视觉和语言（V＆L）模型已经在各种任务上表现出强大的性能，例如图像字幕和视觉问题答案（VQA）。这种模型的质量通常是通过测量其在看不见的数据上的性能来评估的，这些数据通常来自与培训数据相同的分布。但是，当在VQA的分布外（统计外）设置下进行评估时，我们观察到这些模型表现出较差的概括。我们通过进行跨数据库评估，在不同的设置（即分类和开放式文本生成）下全面评估了两个预验证的V＆L模型。我们发现这些模型倾向于学会解决基准，而不是学习VQA任务所需的高级技能。我们还发现，在大多数情况下，生成模型与判别性模型相比不容易受到数据分布的变化，并且多模式预处理通常有助于OOD概括。最后，我们重新审视使用自动VQA评估指标的基础的假设，并从经验上表明，他们的严格性质反复惩罚模型以确保正确的响应。

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题