论文标题
关于TVQA数据集中的模态偏差
On Modality Bias in the TVQA Dataset
论文作者
论文摘要
TVQA是基于热门电视节目的大规模视频答案(视频QA)数据集。这些问题的专门设计是为了需要“视觉和语言理解以回答”。在这项工作中,我们证明了数据集对文本字幕模式的固有偏见。我们推断出直接和间接的偏见,尤其是发现接受字幕训练的模型学习,以抑制视频功能贡献。我们的结果表明,仅对视觉信息训练的模型可以回答约45%的问题,而仅使用字幕达到约68%。我们发现,基于双线性池基于模式的联合表示模型性能损害了模型性能,这意味着依赖模式特定信息。我们还表明,TVQA未能从VQA中普及的Rubi模态减少技术中受益。通过简单地使用BERT嵌入使用首先提出的TVQA的简单模型来改进文本处理,我们与高度复杂的阶段模型(70.50%)相比,我们实现了最先进的结果(72.13%)。我们建议一个多模式评估框架,该框架可以突出模型中的偏见,并隔离数据的视觉和文本依赖数据集。使用此框架,我们提出了TVQA的子集,该子集对任何一种或两种模式都有响应,以促进TVQA最初预期的多模式建模。
TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer ~45% of the questions, while using only the subtitles achieves ~68%. We find that a bilinear pooling based joint representation of modalities damages model performance by 9% implying a reliance on modality specific information. We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13%) compared to the highly complex STAGE model (70.50%). We recommend a multimodal evaluation framework that can highlight biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.