论文标题
从图像中为多模式助手产生自然问题
Generating Natural Questions from Images for Multimodal Assistants
论文作者
论文摘要
从图像中产生自然,多样化和有意义的问题对于多模式助手来说是一项重要任务,因为它确认他们是否正确理解了图像中的对象和场景。视觉问题回答(VQA)和视觉问题产生(VQG)的研究是一个很好的一步。但是,这项研究并未捕获一个近视人会问多模式助手的问题。最近发布的数据集(例如KB-VQA,FVQA和OK-VQA)尝试收集寻找外部知识的问题,以使其适合多模式助手。但是,它们仍然包含许多显而易见和常识性的问题,即人类通常不会问数字助手。在本文中,我们提供了一个新的基准数据集,其中包含人类注释者产生的问题,请记住他们会问多模式数字助手。数十万张图像的大规模注释昂贵且耗时,因此我们还提出了一种自动从看不见的图像产生问题的有效方法。在本文中,我们提出了一种产生各种有意义的问题的方法,该方法考虑图像内容和图像的元数据(例如,位置,关联关键字)。我们使用标准评估指标(例如BLEU,Meteor,Rouge和Cider)评估我们的方法,以表明生成的问题与人提供的问题的相关性。我们还使用生成力量和创造力指标来衡量产生问题的多样性。我们向公众和我们的数据集报告了新的最新结果。
Generating natural, diverse, and meaningful questions from images is an essential task for multimodal assistants as it confirms whether they have understood the object and scene in the images properly. The research in visual question answering (VQA) and visual question generation (VQG) is a great step. However, this research does not capture questions that a visually-abled person would ask multimodal assistants. Recently published datasets such as KB-VQA, FVQA, and OK-VQA try to collect questions that look for external knowledge which makes them appropriate for multimodal assistants. However, they still contain many obvious and common-sense questions that humans would not usually ask a digital assistant. In this paper, we provide a new benchmark dataset that contains questions generated by human annotators keeping in mind what they would ask multimodal digital assistants. Large scale annotations for several hundred thousand images are expensive and time-consuming, so we also present an effective way of automatically generating questions from unseen images. In this paper, we present an approach for generating diverse and meaningful questions that consider image content and metadata of image (e.g., location, associated keyword). We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions with human-provided questions. We also measure the diversity of generated questions using generative strength and inventiveness metrics. We report new state-of-the-art results on the public and our datasets.