论文标题
S-Transformer:良好神经语音综合的段转换器
s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis
论文作者
论文摘要
神经端到端文本对语音(TTS)采用重复模型,例如TACOTRON,或者是关注者,例如变形金刚描述了语音话语,已经实现了言语综合的显着改善。但是,处理不同的句子长度,尤其是对于序列模型对有效上下文长度限制的长句子,仍然非常具有挑战性。我们提出了一个新颖的段转换器(S-转换器),该段在段级别对语音进行了建模,在该段层面上,复发是通过编码器和解码器的缓存记忆重复使用的。远程上下文可以通过扩展的内存来捕获,同时,在细分市场上的编码器描述器的注意力更容易处理。此外,我们采用了经过修改的相对位置自我注意力,以概括序列长度,超过训练数据中可能看不见的时期。通过将所提出的S变换器与标准变压器进行比较,在短句子上,两者都达到4.29的相同MOS得分,该记录非常接近4.32;长期句子的相似分数为4.22 vs 4.2,对于超长句,MOS的增益为0.2。由于缓存的内存随时间更新,因此S-TransFormer长时间会产生自然而连贯的语音。
Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context length. We propose a novel segment-Transformer (s-Transformer), which models speech at segment level where recurrence is reused via cached memories for both the encoder and decoder. Long-range contexts can be captured by the extended memory, meanwhile, the encoder-decoder attention on segment which is much easier to handle. In addition, we employ a modified relative positional self attention to generalize sequence length beyond a period possibly unseen in the training data. By comparing the proposed s-Transformer with the standard Transformer, on short sentences, both achieve the same MOS scores of 4.29, which is very close to 4.32 by the recordings; similar scores of 4.22 vs 4.2 on long sentences, and significantly better for extra-long sentences with a gain of 0.2 in MOS. Since the cached memory is updated with time, the s-Transformer generates rather natural and coherent speech for a long period of time.