论文标题
Mu $^{2} $ slam:多任务,多语言语音和语言模型
Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models
论文作者
论文摘要
我们提出了Mu $^{2} $ SLAM,这是一种多语言序列到序列模型,在未标记的语音,未标记的文本和监督的数据中共同训练,跨越了自动语音识别(ASR),自动语音翻译(AST)和机器翻译(MT),以100多种语言使用。通过利用语音作为目标的量化表示,Mu $^{2} $ SLAM用序列到序列掩盖的剥落的deno目标来训练语音文本模型,类似于解码器的T5和掩盖的语言建模(MLM)在该模型上的目标(MLM),同时使用编码和文本进行了交流的模型,并在内部使用了交叉的模型。在COVOST AST上,Mu $^{2} $ SLAM为在公共数据集中训练的模型建立了新的最先进的模型,从而在上一个最佳的XX-EN翻译上提高了1.9个BLEU点,而EN-XX翻译在1.1 BLEU点上进行了改进。在Voxpopuli ASR上,尽管使用了相对较弱的序列到序列体系结构,但我们的模型与RNN-T解码器微调的MSLAM模型的性能相匹配。在理解任务的文本上,我们的模型在XNLI上的MSLAM超过6 \%,越来越接近XNLI和Tydiqa上可比容量的MT5模型的性能,为所有语音和文本理解任务的单一模型铺平了道路。
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.