Vinet：推动视觉方式的限制来进行视听显着性预测

论文标题

Vinet：推动视觉方式的限制来进行视听显着性预测

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

论文作者

Jain, Samyak, Yarlagadda, Pradeep, Jyoti, Shreyank, Karthik, Shyamgopal, Subramanian, Ramanathan, Gandhi, Vineet

论文摘要

我们为视听显着性预测提供了Vinet架构。 Vinet是一个完全卷积编码器架构。编码器使用训练动作识别的网络中的视觉功能，解码器通过三线插值和3D卷积渗透显着图，结合了多个层次结构的特征。 Vinet的整体体系结构在概念上很简单。它是因果关系，并实时运行（60 fps）。 Vinet不使用音频作为输入，并且在九个不同数据集（三个仅视觉和六个视听数据集）上仍然优于最先进的视听显着性预测模型。 Vinet还超过了AVE数据集的CC，SIM和AUC指标上的人类绩效，据我们所知，这是第一个这样做的网络。我们还通过将音频功能扩展到解码器中来探索Vinet体系结构的变体。令我们惊讶的是，经过足够的培训后，网络将成为输入音频的不可知论，并提供相同的输出，而与输入无关。有趣的是，我们还观察到在先前的最先进模型\ cite {tsiami2020stavis}中，对于视听显着性预测。我们的发现与以前有关基于深度学习的视听显着性预测的作品形成鲜明对比，这表明了以更有效的方式结合音频的未来探索途径。代码和预培训模型可在https://github.com/samyak0210/vinet上找到。

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题