论文标题

部分可观测时空混沌系统的无模型预测

High-resolution embedding extractor for speaker diarisation

论文作者

Heo, Hee-Soo, Kwon, Youngki, Lee, Bong-Jin, Kim, You Jin, Jung, Jee-weon

论文摘要

说话者嵌入提取器会极大地影响基于聚类的扬声器腹泻系统的性能。通常,从每个语音段中提取一个嵌入。但是,由于滑动窗口的方法,由于扬声器更换点,因此很容易包含两个或更多扬声器。这项研究提出了一种新颖的嵌入提取器结构,称为高分辨率嵌入式提取器(HEE),该嵌入式嵌入式架构(HEE)从每个语音段提取了多个高分辨率嵌入。 HEE由一个特征图提取器和增强剂组成,其中具有自我发项机制的增强剂是成功的关键。 HEE的增强子取代了聚集过程。增强器不是全局合并层,而是通过利用全局上下文来结合每个框架的相对信息。提取的密集框架级嵌入可以分别代表扬声器。因此,多个扬声器可以用每个段中的不同帧级特征表示。我们还提出了一个人为生成混合物数据训练框架来训练提议的HEE。通过对五个评估集(包括四个公共数据集)的实验,提出的HEE在每个评估集中至少提高了10%的改善,除了一个数据集,我们分析了快速说话者的变化较少。

Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set, except for one dataset, which we analyse that rapid speaker changes less exist.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源