PRISM：预先训练的不确定的说话者代表模型，用于说话者诊断和扬声器验证

论文标题

PRISM：预先训练的不确定的说话者代表模型，用于说话者诊断和扬声器验证

PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

论文作者

Zheng, Siqi, Suo, Hongbin, Chen, Qian

论文摘要

演讲者嵌入一直是与说话者相关的任务（例如验证，聚类和诊断）的基本功能。传统上，扬声器的嵌入表示为高维空间中的固定向量。这可能会导致估计有偏见，尤其是在处理较短的话语时。在本文中，我们建议将说话者的话语表示为“浮动”向量，其状态不确定而不知道上下文。说话者代表的状态本身是共同确定的，来自同一说话者的其他演讲以及与其他演讲者进行了比较。演讲的内容还有助于确定说话者代表的最终状态。我们预先训练了不确定的说话者表示模型，该模型估算了基于上下文的话语状态。可以对预训练的模型进行微调，以进行下游任务，例如扬声器验证，扬声器聚类和扬声器诊断。在所有下游任务中都可以观察到实质性改进。

Speaker embedding has been a fundamental feature for speaker-related tasks such as verification, clustering, and diarization. Traditionally, speaker embeddings are represented as fixed vectors in high-dimensional space. This could lead to biased estimations, especially when handling shorter utterances. In this paper we propose to represent a speaker utterance as "floating" vector whose state is indeterminate without knowing the context. The state of a speaker representation is jointly determined by itself, other speech from the same speaker, as well as other speakers it is being compared to. The content of the speech also contributes to determining the final state of a speaker representation. We pre-train an indeterminate speaker representation model that estimates the state of an utterance based on the context. The pre-trained model can be fine-tuned for downstream tasks such as speaker verification, speaker clustering, and speaker diarization. Substantial improvements are observed across all downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题