论文标题
用时频变压器建模节拍和下调
Modeling Beats and Downbeats with a Time-Frequency Transformer
论文作者
论文摘要
变形金刚是一个成功的深度神经网络(DNN)体系结构,不仅在自然语言处理中而且在音乐信息检索(MIR)中都显示了其多功能性。在本文中,我们提出了一种基于变压器的新型方法,可以解决节拍和下调跟踪。该方法采用SpectNT(变压器中的光谱 - 周期性变压器),这是变压器的一种变体,该变量模拟了音乐音频的时频输入的光谱和时间尺寸。 SPECTNT模型使用一堆块,每个块由两个级别的变压器编码器组成。较低级(或光谱)编码器处理光谱特征,并使模型能够注意每个帧的谐波组件。由于下调表示条形边界,并且经常伴有谐波变化,因此此步骤可能有助于下调建模。上层(或时间)编码器汇总了有用的本频谱信息,以关注/降压位置。我们还提出了将SpectNT与最先进的模型(TCN)结合使用的体系结构,以进一步提高性能。广泛的实验表明,我们的方法在下调跟踪中的表现可以显着胜过TCN,同时保持节拍跟踪的可比结果。
Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral-Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of-the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.