空间注意作为图像字幕模型的接口

论文标题

空间注意作为图像字幕模型的接口

Spatial Attention as an Interface for Image Captioning Models

论文作者

Sadler, Philipp

论文摘要

尽管涉及空间注意机制，但现代深度学习模型的内部工作通常不清楚外部观察者。这项工作的想法是将这些空间注意力转化为自然语言，以更简单地访问模型的功能。因此，我采用了一个神经图像字幕模型，并在其空间关注的三种不同界面方法中测量了对外部修饰的反应：整个生成过程中的固定，对第一次时间阶段的固定以及对生成器的注意力的补充。基于边界框的空间注意向量的实验结果表明，字幕模型对方法依赖性变化的反应高达52.65％，其中包括9.00％的对象类别，否则未提及。之后，我建立了与层次结构共同注意网络的链接，用于通过提取单词，短语和问题级别的空间关注来回答视觉问题。在这里，生成的字幕为单词级别的字幕包括了提问的详细信息，最多55.20％。这项工作表明，将其视为图像字幕发生器的外部接口是一种有用的方法，是访问自然语言的视觉功能的有用方法。

The internal workings of modern deep learning models stay often unclear to an external observer, although spatial attention mechanisms are involved. The idea of this work is to translate these spatial attentions into natural language to provide a simpler access to the model's function. Thus, I took a neural image captioning model and measured the reactions to external modification in its spatial attention for three different interface methods: a fixation over the whole generation process, a fixation for the first time-steps and an addition to the generator's attention. The experimental results for bounding box based spatial attention vectors have shown that the captioning model reacts to method dependent changes in up to 52.65% and includes in 9.00% of the cases object categories, which were otherwise unmentioned. Afterwards, I established such a link to a hierarchical co-attention network for visual question answering by extraction of its word, phrase and question level spatial attentions. Here, generated captions for the word level included details of the question-answer pairs in up to 55.20% of the cases. This work indicates that spatial attention seen as an external interface for image caption generators is an useful method to access visual functions in natural language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题