MRTNET：视频句子接地的多分辨率临时网络

论文标题

MRTNET：视频句子接地的多分辨率临时网络

MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding

论文作者

Ji, Wei, Chen, Long, Wei, Yinwei, Wu, Yiming, Chua, Tat-Seng

论文摘要

鉴于未经修剪的视频和自然语言查询，视频句子接地旨在将目标时间段落定位在视频中。现有方法主要通过在单个时间分辨率上匹配和对齐描述性句子和视频段的语义来解决此任务，同时忽略了不同分辨率中视频内容的时间一致性。在这项工作中，我们提出了一个新型的多分辨率时间视频句子接地网络：mrtnet，它由多模式特征编码器，多分辨率的时间（MRT）模块和一个预测器模块组成。 MRT模块是一个编码器网络，解码器部分中的输出功能与变压器结合使用，以预测最终的启动和结束时间戳。特别是，我们的MRT模块是热水量的，这意味着可以将其无缝合并到任何无锚模型中。此外，我们利用杂种损失来监督MRT模块中的跨模式特征，以更准确地接地，以三个尺度：框架级，夹子级和序列级别。在三个普遍数据集上进行了广泛的实验，显示了mrtnet的有效性。

Given an untrimmed video and natural language query, video sentence grounding aims to localize the target temporal moment in the video. Existing methods mainly tackle this task by matching and aligning semantics of the descriptive sentence and video segments on a single temporal resolution, while neglecting the temporal consistency of video content in different resolutions. In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is an encoder-decoder network, and output features in the decoder part are in conjunction with Transformers to predict the final start and end timestamps. Particularly, our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models. Besides, we utilize a hybrid loss to supervise cross-modal features in MRT module for more accurate grounding in three scales: frame-level, clip-level and sequence-level. Extensive experiments on three prevalent datasets have shown the effectiveness of MRTNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题