AVE-CLIP：基于听力的多窗口临时变压器，用于视觉事件本地化

论文标题

AVE-CLIP：基于听力的多窗口临时变压器，用于视觉事件本地化

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

论文作者

Mahmud, Tanvir, Marculescu, Diana

论文摘要

视频和听觉信号在视频段中的对应关系表示视听事件（AVE）。 Aves的精确定位非常具有挑战性，因为它要求有效的多模式特征对应关系，以将短期和远程的时间相互作用扎根。由于无效的多模式训练策略，现有的方法在捕获多模式相互作用的不同尺度方面艰难。为了克服这一限制，我们引入了AVE-CLIP，这是一个新颖的框架，该框架集成了在大规模的音频数据上预先训练的听力框架，并与多窗口时间变压器一起在视频框架的不同时间尺度上有效地操作。我们的贡献是三倍：（1）我们引入了一个多阶段训练框架，以通过对比度微调，有效的平均视频功能提取和多尺度训练阶段将预先训练的AudioClip与Audio Image Pairs合并到视频框架上的AVE本地化任务中。（2）我们提出了一种多域注意机制，该机制在不同的时间尺度上都在时间和特征域上运行，以融合局部和全局特征变化。（3）我们介绍了一种时间炼油方案，并以事件引导的注意力进行了关注，然后是一个简单有效的后处理步骤，以处理背景对各种事件的显着变化。我们的方法在公开可用的AVE数据集上实现了最先进的性能，其平均准确性提高了5.9％，这证明了其优于现有方法的优势。

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题