论文标题

深入研究跨语性的视觉问题回答

Delving Deeper into Cross-lingual Visual Question Answering

论文作者

Liu, Chen, Pfeiffer, Jonas, Korhonen, Anna, Vulić, Ivan, Gurevych, Iryna

论文摘要

视觉问题回答(VQA)是关键的视觉和语言任务之一。但是,由于缺乏合适的评估资源,现有的VQA研究主要集中在英语上。先前关于跨语性VQA的工作报告说,当前多语言多模式变压器的零拍传递性能差,没有任何更深入的分析,具有较大的差距到单语性能。在这项工作中,我们深入研究了跨语言VQA的不同方面,旨在了解1)建模方法和选择的影响,包括体系结构,归纳偏见,微调; 2)学习偏见:包括问题类型和模态偏见。我们分析的关键结果是:1)我们表明,对标准培训设置的简单修改可以大大减少单语言性能的转移差距,从而在现有方法上产生+10精度的精度。 2)我们针对不同的多语言多模式变压器的不同问题类型的不同问题类型分析了跨语性VQA,并确定了最难改进的问题类型。 3)我们提供了培训数据和模型中存在的模态偏见的分析,揭示了为什么在某些问题类型和语言中仍然保留零击性能差距。

Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1) We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy points over existing methods. 2) We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers, and identify question types that are the most difficult to improve on. 3) We provide an analysis of modality biases present in training data and models, revealing why zero-shot performance gaps remain for certain question types and languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源