论文标题
Alcap:对齐的音乐字幕
ALCAP: Alignment-Augmented Music Captioner
论文作者
论文摘要
随着流媒体媒体平台不断上升,音乐字幕引起了人们的关注。传统方法通常优先考虑音乐的音频或歌词方面,无意中忽略了两者之间的复杂相互作用。但是,对音乐的全面理解需要这两个元素的整合。在这项研究中,我们通过引入一种通过对比学习来系统地学习音频和歌词之间多模式对齐的方法来深入研究这个被忽视的领域。这不仅承认并强调了音频和歌词之间的协同作用,而且还为模型实现更深层的跨模式连贯性铺平了道路,从而产生了高质量的字幕。我们提供理论和经验结果,证明了所提出的方法的优势,该方法在两个音乐字幕数据集上实现了新的最新方法。
Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into this overlooked realm by introducing a method to systematically learn multimodal alignment between audio and lyrics through contrastive learning. This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence, thereby producing high-quality captions. We provide both theoretical and empirical results demonstrating the advantage of the proposed method, which achieves new state-of-the-art on two music captioning datasets.