大师：通过模态匹配匹配的语音文本表示

论文标题

大师：通过模态匹配匹配的语音文本表示

MAESTRO: Matched Speech Text Representations through Modality Matching

论文作者

Chen, Zhehuai, Zhang, Yu, Rosenberg, Andrew, Ramabhadran, Bhuvana, Moreno, Pedro, Bapna, Ankur, Zen, Heiga

论文摘要

我们提出了一种自我监督的培训方法，以统一从语音和文本方式中学到的表示形式。从语音信号中的自我监督学习旨在学习信号中固有的潜在结构，而从文本尝试捕获词汇信息的文本尝试中学习。从不配对的语音和文本序列中学习对齐表示是一项艰巨的任务。先前的工作要么隐含地强制执行从这两种方式中学到的表示形式，要通过多任务和参数共享在潜在空间中对齐，或通过语音综合通过模态转换来明确。前者受到两种方式之间的干扰，而后者则引入了额外的复杂性。在本文中，我们提出了一种新颖的算法Maestro，旨在同时从这两种方式中学习统一的表示，可以转移到各种下游任务，例如自动语音识别（ASR）和语音翻译（ST）。 Maestro通过序列比对，持续时间预测和匹配嵌入在学习空间中的统一表示形式，通过对齐的蒙版模型损失。我们在Voxpopuli多语言ASR上建立了一个新的最先进（SOTA），单词错误率相对降低8％（WER），MultiDobain Speakstew ASR（相对3.7％）和21种英语多语言ST语言在Covost 2上的2.8 BLEU平均2.8 Bleu平均为2.8个语言。

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multitasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST). Maestro learns unified representations through sequence alignment, duration prediction and matching embeddings in the learned space through an aligned masked-language model loss. We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 8% relative reduction in Word Error Rate (WER), multidomain SpeechStew ASR (3.7% relative) and 21 languages to English multilingual ST on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题