基于变压器的语音合成器归因在开放式场景中

论文标题

基于变压器的语音合成器归因在开放式场景中

Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario

论文作者

Bartusiak, Emily R., Delp, Edward J.

论文摘要

语音综合方法可以创建听起来很现实的语音，这些语音可用于欺诈，欺骗和错误信息广告系列。检测综合语音的法医方法对于防止此类攻击很重要。法医归因方法提供了有关综合语音信号性质的更多信息，因为它们可以识别用于创建语音信号的特定语音合成方法（即语音合成器）。由于越来越多的逼真的语音合成器数量，我们提出了一种语音归因方法，该方法将概括为训练期间未见的新合成器。为此，我们在封闭的场景和开放式场景方案中都研究了语音合成器归因。换句话说，我们认为某些语音合成器是“已知”合成器（即封闭集的一部分），而其他语音合成器则为“未知”合成器（即开放集的一部分）。我们将语音信号表示为频谱图，并在用于多类分类的封闭集上训练我们所提出的方法（称为紧凑型归因变压器（CAT））。然后，我们将分析扩展到开放式集合，以将综合语音信号归因于已知和未知合成器。我们利用训练有素猫的潜在空间上的T分布的随机邻居嵌入（TSNE）来区分每个未知的合成器。此外，我们探讨了多-1损失公式以改善归因结果。我们提出的方法成功地将综合语音信号归因于封闭和开放式场景中的各自的语音合成器。

Speech synthesis methods can create realistic-sounding speech, which may be used for fraud, spoofing, and misinformation campaigns. Forensic methods that detect synthesized speech are important for protection against such attacks. Forensic attribution methods provide even more information about the nature of synthesized speech signals because they identify the specific speech synthesis method (i.e., speech synthesizer) used to create a speech signal. Due to the increasing number of realistic-sounding speech synthesizers, we propose a speech attribution method that generalizes to new synthesizers not seen during training. To do so, we investigate speech synthesizer attribution in both a closed set scenario and an open set scenario. In other words, we consider some speech synthesizers to be "known" synthesizers (i.e., part of the closed set) and others to be "unknown" synthesizers (i.e., part of the open set). We represent speech signals as spectrograms and train our proposed method, known as compact attribution transformer (CAT), on the closed set for multi-class classification. Then, we extend our analysis to the open set to attribute synthesized speech signals to both known and unknown synthesizers. We utilize a t-distributed stochastic neighbor embedding (tSNE) on the latent space of the trained CAT to differentiate between each unknown synthesizer. Additionally, we explore poly-1 loss formulations to improve attribution results. Our proposed approach successfully attributes synthesized speech signals to their respective speech synthesizers in both closed and open set scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题