在上下文中的音频文本检索

论文标题

在上下文中的音频文本检索

Audio-text Retrieval in Context

论文作者

Lou, Siyu, Xu, Xuenan, Wu, Mengyue, Yu, Kai

论文摘要

基于自然语言描述的音频文本检索是一项具有挑战性的任务。它涉及在数据条件不足的长序列之间学习跨模式对齐。在这项工作中，我们研究了几种音频功能以及序列聚集方法，以更好地音频对齐。此外，通过定性分析，我们观察到语义映射比上下文检索中的时间关系更重要。使用预先训练的音频功能和基于描述符的聚合方法，我们构建了上下文的音频文本检索系统。具体而言，我们利用PANNS功能在大型声音事件数据集和NetRvlad Pooling上进行了预训练，这些功能直接与平均描述符一起使用。实验是在视听数据集上进行的，并将结果与先前的最新系统进行了比较。借助我们提出的系统，所有指标，包括召回，中位数和平均等级，都在双向音频检索方面取得了重大改进。

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.

下载PDF全文

下载文献需遵守相关版权规定

论文标题