使用预先训练的大规模语言模型的音频字幕，以基于音频的类似标题检索为指导

论文标题

使用预先训练的大规模语言模型的音频字幕，以基于音频的类似标题检索为指导

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

论文作者

Koizumi, Yuma, Ohishi, Yasunori, Niizumi, Daisuke, Takeuchi, Daiki, Yasuda, Masahiro

论文摘要

音频字幕的目的是将输入音频转换为使用自然语言的描述。音频字幕的问题之一是缺乏训练数据，因为难以通过爬网来收集音频对接对。在这项研究中，为了克服这个问题，我们建议使用预先训练的大规模语言模型。由于音频输入不能直接输入到这种语言模型中，因此我们根据可能在不同音频中可能存在的相似性来利用从培训数据集检索的指导字幕。然后，通过使用预训练的语言模型在参考指导字幕时，通过使用预训练的语言模型来生成音频输入的标题。实验结果表明，（i）所提出的方法成功地使用了预训练的语言模型进行音频字幕，并且（ii）显然，基于预先训练的基于模型的字幕生成器的甲骨文性能要比从Scratch培训的常规方法要好。

The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted into such a language model, we utilize guidance captions retrieved from a training dataset based on similarities that may exist in different audio. Then, the caption of the audio input is generated by using a pre-trained language model while referring to the guidance captions. Experimental results show that (i) the proposed method has succeeded to use a pre-trained language model for audio captioning, and (ii) the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题