NIST 2019多媒体发言人识别评估的HLT-NUS提交

论文标题

NIST 2019多媒体发言人识别评估的HLT-NUS提交

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

论文作者

Das, Rohan Kumar, Tao, Ruijie, Yang, Jichen, Rao, Wei, Yu, Cheng, Li, Haizhou

论文摘要

这项工作描述了新加坡国立大学（HLT-NUS）为2019年NIST多媒体演讲者识别评估（SRE）开发的说话者验证系统。多媒体研究引起了广泛的应用，而说话者的识别也不例外。与以前的NIST SRE相比，最新版本着重于多媒体曲目，以识别具有音频和视觉信息的扬声器。我们为音频和视觉输入开发了单独的系统，然后开发了从两种模式的系统融合的分数融合，以共同使用其信息。音频系统基于基于X矢量的扬声器嵌入，而面部识别系统则基于基于重新启动和洞察力的面部嵌入。通过评估后的研究和改进，我们在2019 NIST Multimedia sre colpus的评估集中获得了0.88％的误差率（EER），实际检测成本函数（ACTDCF）为0.026。

This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses on a multimedia track to recognize speakers with both audio and visual information. We developed separate systems for audio and visual inputs followed by a score level fusion of the systems from the two modalities to collectively use their information. The audio systems are based on x-vector based speaker embedding, whereas the face recognition systems are based on ResNet and InsightFace based face embeddings. With post evaluation studies and refinements, we obtain an equal error rate (EER) of 0.88% and an actual detection cost function (actDCF) of 0.026 on the evaluation set of 2019 NIST multimedia SRE corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题