多渠道扬声器的比较研究在多方会议上进行自动识别

论文标题

多渠道扬声器的比较研究在多方会议上进行自动识别

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

论文作者

Shi, Mohan, Zhang, Jie, Du, Zhihao, Yu, Fan, Chen, Qian, Zhang, Shiliang, Dai, Li-Rong

论文摘要

多方会议场景中，演讲者属性的自动语音识别（SA-ASR）是最有价值，最具挑战性的ASR任务之一。结果表明，具有串行的输出训练（SC-FD-SOT），单渠道单词级诊断的单通道框架级诊断，带有SOT（SC-WD-SOT）（SC-WD-SOT）的单通道级诊断以及单渠道目标扬声器分离和ASR（SC-TS-ASR）的联合训练可以利用以部分解决此问题。在本文中，我们提出了三种相应的多通道（MC）SA-ASR方法，即MC-FD-SOT，MC-WD-SOT和MC-TS-ASR。对于不同的任务/模型，考虑了不同的多通道数据融合策略，包括用于MC-FD-SOT的通道级跨渠道注意，MC-WD-SOT的帧级跨通道注意力以及MC-TS-ASR的神经光束成立。关于呼叫者语料库的结果表明，我们所提出的模型可以始终如一地优于相应的单通道对应物，而与说话者有关的字符错误率。

Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题