歌手：实时音乐伴奏而没有逻辑延迟或曝光偏见

论文标题

歌手：实时音乐伴奏而没有逻辑延迟或曝光偏见

SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias

论文作者

Wang, Zihao, Liang, Qihao, Zhang, Kejun, Wang, Yuxing, Zhang, Chen, Yu, Pengfei, Feng, Yongsheng, Liu, Wenbo, Wang, Yikai, Bao, Yuntai, Yang, Yiheng

论文摘要

实时音乐伴奏的生成在音乐行业（例如音乐教育和现场表演）中具有广泛的应用。但是，自动实时音乐伴奏的产生仍在研究中，并且经常在逻辑延迟和曝光偏见之间面临折衷。在本文中，我们提出了Songdriver，这是一种无逻辑延迟或暴露偏见的实时音乐伴奏系统。具体而言，歌曲驱动程序将一个伴奏的生成任务分为两个阶段：1）安排阶段，其中变压器模型首先安排了和弦，以实时进行输入旋律，并在下一阶段加速了和弦，而不是弹奏。 2）预测阶段，其中CRF模型基于先前缓存的和弦生成了即将到来的旋律的可播放的多轨伴奏。通过这种两相策略，歌手直接生成即将到来的旋律的伴奏，从而达到了零逻辑延迟。此外，在预测时间步的和弦时，歌手是指第一阶段的缓存和弦，而不是先前的预测，这避免了暴露偏见问题。由于输入长度通常在实时条件下受到限制，因此另一个潜在的问题是长期顺序信息的丢失。为了弥补这一缺点，我们在当前时间步骤作为全球信息之前从长期音乐作品中提取了四个音乐功能。在实验中，我们在一些开源数据集上训练歌驱动器，以及由中国风格的现代流行音乐得分构建的原始àisong数据集。结果表明，歌手在客观和主观指标上都优于现有的SOTA（最新）模型，同时大大降低了物理潜伏期。

Real-time music accompaniment generation has a wide range of applications in the music industry, such as music education and live performances. However, automatic real-time music accompaniment generation is still understudied and often faces a trade-off between logical latency and exposure bias. In this paper, we propose SongDriver, a real-time music accompaniment generation system without logical latency nor exposure bias. Specifically, SongDriver divides one accompaniment generation task into two phases: 1) The arrangement phase, where a Transformer model first arranges chords for input melodies in real-time, and caches the chords for the next phase instead of playing them out. 2) The prediction phase, where a CRF model generates playable multi-track accompaniments for the coming melodies based on previously cached chords. With this two-phase strategy, SongDriver directly generates the accompaniment for the upcoming melody, achieving zero logical latency. Furthermore, when predicting chords for a timestep, SongDriver refers to the cached chords from the first phase rather than its previous predictions, which avoids the exposure bias problem. Since the input length is often constrained under real-time conditions, another potential problem is the loss of long-term sequential information. To make up for this disadvantage, we extract four musical features from a long-term music piece before the current time step as global information. In the experiment, we train SongDriver on some open-source datasets and an original àiSong Dataset built from Chinese-style modern pop music scores. The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics, meanwhile significantly reducing the physical latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题