论文标题
Mavic:视频字幕的多模式主动学习
MAViC: Multimodal Active Learning for Video Captioning
论文作者
论文摘要
培训视频字幕模型需要大量注释的视频捕获对,导致注释成本很高。积极学习可以有助于减少这些注释要求。但是,视频字幕的主动学习是具有挑战性的,因为多个语义上相似的字幕对视频有效,即使对于不太信息的样本,也会导致高熵输出。此外,视频字幕算法本质上是多模式的,具有视觉编码器和语言解码器。此外,输出的顺序和组合性质使问题更具挑战性。在本文中,我们介绍了MAVIC,该Mavic利用了我们提出的基于基于多模式的语义意识熵(M-SASE)的采集功能,以应对视频字幕的主动学习方法的挑战。我们的方法整合了采集函数中视觉和语言维度的语义相似性和不确定性。我们的详细实验在经验上证明了M-Sase在主动学习视频字幕上的功效,并通过很大的边距改进基线。
A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.