基于跨渠道注意的目标扬声器语音活动检测：M2MET挑战的实验结果

论文标题

基于跨渠道注意的目标扬声器语音活动检测：M2MET挑战的实验结果

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge

论文作者

Wang, Weiqing, Qin, Xiaoyi, Li, Ming

论文摘要

在本文中，我们介绍了DKU_Dukeece团队的多方多方会议转录挑战（M2MET）的扬声器诊断系统。由于数据集中存在高度重叠的语音，我们采用了基于X矢量的目标言论语音活动检测（TS-VAD）来查找扬声器之间的重叠。对于单渠道方案，我们分别训练了8个通道中的每个通道的模型，并融合了结果。我们还采用了跨渠道自我注意力来进一步提高性能，在这种情况下，学习和融合了不同渠道之间的非线性空间相关性。评估集的实验结果表明，单渠道TS-VAD将DER从12.68 \％降低至3.14％。多通道TS-VAD进一步将DER降低了28％，并达到2.26％。我们的最终提交系统在Alimeeting测试集上达到2.98％，该测试集中在M2Met挑战中排名第一。

In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68\% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题