用于检测综合语音的深层光谱伪影

论文标题

用于检测综合语音的深层光谱伪影

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

论文作者

Liu, Xiaohui, Liu, Meng, Zhang, Lin, Zhang, Linjuan, Zeng, Chang, Li, Kai, Li, Nan, Lee, Kong Aik, Wang, Longbiao, Dang, Jianwu

论文摘要

音频深层综合检测（ADD）挑战已被提出以检测产生的类似人类的语音。借助我们提交的系统，本文提供了曲目1（低质量的伪造音频检测）和曲目2（部分假音频检测）的总体评估。在本文中，使用原始的时间信号，光谱特征以及深层嵌入特征检测到光谱时期伪像。为了解决轨道1，在我们的系统中汇总了低质量的数据增强，通过填充的域适应和各种互补特征信息融合。此外，我们通过可视化方法分析了子系统的聚类特征，并解释了我们提出的贪婪融合策略的有效性。至于轨道2，使用自我监督的学习结构检测到帧过渡和平滑，以捕获时域中PF攻击的操纵。我们在轨道1和轨道2中分别排名第四和第五。

The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题