论文标题

多渠道扬声器的比较研究在多方会议上进行自动识别

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

论文作者

Shi, Mohan, Zhang, Jie, Du, Zhihao, Yu, Fan, Chen, Qian, Zhang, Shiliang, Dai, Li-Rong

论文摘要

多方会议场景中,演讲者属性的自动语音识别(SA-ASR)是最有价值,最具挑战性的ASR任务之一。结果表明,具有串行的输出训练(SC-FD-SOT),单渠道单词级诊断的单通道框架级诊断,带有SOT(SC-WD-SOT)(SC-WD-SOT)的单通道级诊断以及单渠道目标扬声器分离和ASR(SC-TS-ASR)的联合训练可以利用以部分解决此问题。在本文中,我们提出了三种相应的多通道(MC)SA-ASR方法,即MC-FD-SOT,MC-WD-SOT和MC-TS-ASR。对于不同的任务/模型,考虑了不同的多通道数据融合策略,包括用于MC-FD-SOT的通道级跨渠道注意,MC-WD-SOT的帧级跨通道注意力以及MC-TS-ASR的神经光束成立。关于呼叫者语料库的结果表明,我们所提出的模型可以始终如一地优于相应的单通道对应物,而与说话者有关的字符错误率。

Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源