论文标题
可靠的视觉问题回答:放弃而不是错误地回答
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
论文作者
论文摘要
机器学习已经急剧提高,在多模式任务(如视觉质疑答案(VQA))等多模式任务中缩小了对人类的准确差距。但是,尽管人类在不确定的时候可以说“我不知道”(即,避免回答问题),但在多模式研究中,这种能力在很大程度上被忽略了,尽管该问题对在真实环境中使用VQA的使用很重要。在这项工作中,我们为可靠的VQA提出了一个问题制定,我们更喜欢弃权,而不是提供错误的答案。我们首先为多种VQA模型提供了弃权功能,并分析了它们的覆盖范围,回答的问题的一部分以及风险,该部分的错误。为此,我们探索了几种弃权方法。我们发现,尽管最佳性能模型在VQA V2数据集上实现了超过70%的精度,但通过直接使用模型的SoftMax得分,引入了弃权的选项,限制了它们的限制,以回答少于7.5%的问题以达到低风险的错误风险(即1%)。这促使我们利用多模式选择功能直接估计预测答案的正确性,我们显示的可以将覆盖范围提高,例如2.3倍从6.8%到1%的风险。尽管分析覆盖范围和风险很重要,但这些指标具有权衡,这使得比较VQA模型具有挑战性。为了解决这个问题,我们还建议对VQA的有效可靠性指标,与弃权相比,将不正确的答案的成本更高。 VQA的这种新问题,指标和分析为构建有效和可靠的VQA模型提供了基础,这些模型具有自我意识,并且仅当他们不知道答案时就可以戒除。
Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that, we explore several abstention approaches. We find that although the best performing models achieve over 70% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a model's softmax scores limits them to answering less than 7.5% of the questions to achieve a low risk of error (i.e., 1%). This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can increase the coverage by, for example, 2.3x from 6.8% to 15.6% at 1% risk. While it is important to analyze both coverage and risk, these metrics have a trade-off which makes comparing VQA models challenging. To address this, we also propose an Effective Reliability metric for VQA that places a larger cost on incorrect answers compared to abstentions. This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don't know the answer.