论文标题
Essumm:未转录会议的提取性语音摘要
ESSumm: Extractive Speech Summarization from Untranscribed Meeting
论文作者
论文摘要
在本文中,我们提出了一种新颖的架构,用于直接提取语音到语音摘要,essumm,它是一个无监督的模型,而无需依赖中间转录的文本。与以前的文本演示方法不同的方法不同,我们旨在直接从语音中生成摘要,而无需转录。首先,根据语音信号的声学特征提取一组较小的语音段。对于每个候选语音段,基于距离的汇总置信度得分是为潜在的语音表示度量设计的。具体来说,我们利用现成的自我监督卷积神经网络来提取RAW Audio的深层语音功能。我们的方法会自动预测具有目标摘要长度的关键信息的最佳语音段序列。两个著名的会议数据集(AMI和ICSI语料库)的广泛结果表明,我们基于直接语音的方法通过未转录的数据提高汇总质量的有效性。我们还观察到,我们的无监督语音方法甚至在需要额外的语音识别的情况下以近期基于成绩单的摘要方法进行表现。
In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets (AMI and ICSI corpora) show the effectiveness of our direct speech-based method to improve the summarization quality with untranscribed data. We also observe that our unsupervised speech-based method even performs on par with recent transcript-based summarization approaches, where extra speech recognition is required.