利用多渠道端到端语音识别的单渠道演讲：比较研究

论文标题

利用多渠道端到端语音识别的单渠道演讲：比较研究

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

论文作者

An, Keyu, Xiao, Ji, Ou, Zhijian

论文摘要

最近，多通道ASR的端到端训练方法表明其有效性，通常由横梁成形前端和识别后端组成。但是，由于多个模块的整合，端到端训练变得更加困难，特别是考虑到记录在实际环境中的多频道语音数据的大小有限。这增加了对多渠道端到端ASR利用单通道数据的需求。在本文中，我们系统地比较了三个方案的性能，以利用外部单渠道数据进行多通道端到端ASR，即在不同的设置下的后端预训练，数据调度和数据模拟，例如单渠道数据的大小以及前端的选择。 Chime-4和Aishell-4数据集进行的广泛实验表明，尽管所有三种方法都改善了多渠道端到端语音识别性能，但数据模拟以较长的培训时间的成本优于其他两个方法。数据调度的表现优于后端预训练的略有差距，但几乎始终如一，大概是因为在训练阶段，后端倾向于在单渠道数据上过度适应，尤其是当单渠道数据大小很小的时候。

Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which usually consists of a beamforming front-end and a recognition back-end. However, the end-to-end training becomes more difficult due to the integration of multiple modules, particularly considering that multi-channel speech data recorded in real environments are limited in size. This raises the demand to exploit the single-channel data for multi-channel end-to-end ASR. In this paper, we systematically compare the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single-channel data and the choices of the front-end. Extensive experiments on CHiME-4 and AISHELL-4 datasets demonstrate that while all three methods improve the multi-channel end-to-end speech recognition performance, data simulation outperforms the other two, at the cost of longer training time. Data scheduling outperforms back-end pre-training marginally but nearly consistently, presumably because that in the pre-training stage, the back-end tends to overfit on the single-channel data, especially when the single-channel data size is small.

下载PDF全文

下载文献需遵守相关版权规定

论文标题