论文标题
与语音和双语文本进行的联合预培训,直接语音到语音翻译
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation
论文作者
论文摘要
与级联S2ST相比,直接语音到语音翻译(S2ST)是一个有吸引力的研究主题。但是,直接的S2ST遇到了数据稀缺问题,因为从源语言到目标语言语音的语料库非常罕见。为了解决这个问题,我们在本文中提出了一个Speech2S模型,该模型是通过未配对的语音和双语文本数据共同培训的,用于直接语音到语音翻译任务。通过有效利用配对的文本数据,Speech2S能够建模从源到目标语言的跨语性语音转换。我们验证拟议的Speech2s在Europarl-St和Voxpopuli数据集上的性能。实验结果表明,与仅共同培训模型相比,Speech2s的改进约为5个BLEU分数,并且比现有的最新模型具有竞争性甚至更好的性能1。
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.