论文标题
匹配文本和音频嵌入:探索基于语言音频检索的转移学习策略
Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval
论文作者
论文摘要
我们介绍了用于跨模式(文本到原告)检索的大规模预处理深度学习模型的分析。我们在公制学习框架中使用这些模型提取的嵌入式来连接匹配的音频和文本。浅神经网络将嵌入到共同的维度。我们的系统是我们提交给基于语言的音频检索任务2022的延伸,它采用了罗伯塔基金会模型作为文本嵌入式提取器。预验证的PANN模型提取了音频嵌入。为了改善模型的概括,我们研究了从在线平台中收集的音频和相关嘈杂文本的预处理如何提高方法的性能。此外,我们的消融研究表明,正确选择损失功能和微调模型对于训练竞争性检索系统至关重要。
We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system.