论文标题
在上下文中的音频文本检索
Audio-text Retrieval in Context
论文作者
论文摘要
基于自然语言描述的音频文本检索是一项具有挑战性的任务。它涉及在数据条件不足的长序列之间学习跨模式对齐。在这项工作中,我们研究了几种音频功能以及序列聚集方法,以更好地音频对齐。此外,通过定性分析,我们观察到语义映射比上下文检索中的时间关系更重要。使用预先训练的音频功能和基于描述符的聚合方法,我们构建了上下文的音频文本检索系统。具体而言,我们利用PANNS功能在大型声音事件数据集和NetRvlad Pooling上进行了预训练,这些功能直接与平均描述符一起使用。实验是在视听数据集上进行的,并将结果与先前的最新系统进行了比较。借助我们提出的系统,所有指标,包括召回,中位数和平均等级,都在双向音频检索方面取得了重大改进。
Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.