论文标题

利用多渠道端到端语音识别的单渠道演讲:比较研究

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

论文作者

An, Keyu, Xiao, Ji, Ou, Zhijian

论文摘要

最近,多通道ASR的端到端训练方法表明其有效性,通常由横梁成形前端和识别后端组成。但是,由于多个模块的整合,端到端训练变得更加困难,特别是考虑到记录在实际环境中的多频道语音数据的大小有限。这增加了对多渠道端到端ASR利用单通道数据的需求。在本文中,我们系统地比较了三个方案的性能,以利用外部单渠道数据进行多通道端到端ASR,即在不同的设置下的后端预训练,数据调度和数据模拟,例如单渠道数据的大小以及前端的选择。 Chime-4和Aishell-4数据集进行的广泛实验表明,尽管所有三种方法都改善了多渠道端到端语音识别性能,但数据模拟以较长的培训时间的成本优于其他两个方法。数据调度的表现优于后端预训练的略有差距,但几乎始终如一,大概是因为在训练阶段,后端倾向于在单渠道数据上过度适应,尤其是当单渠道数据大小很小的时候。

Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which usually consists of a beamforming front-end and a recognition back-end. However, the end-to-end training becomes more difficult due to the integration of multiple modules, particularly considering that multi-channel speech data recorded in real environments are limited in size. This raises the demand to exploit the single-channel data for multi-channel end-to-end ASR. In this paper, we systematically compare the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single-channel data and the choices of the front-end. Extensive experiments on CHiME-4 and AISHELL-4 datasets demonstrate that while all three methods improve the multi-channel end-to-end speech recognition performance, data simulation outperforms the other two, at the cost of longer training time. Data scheduling outperforms back-end pre-training marginally but nearly consistently, presumably because that in the pre-training stage, the back-end tends to overfit on the single-channel data, especially when the single-channel data size is small.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源