论文标题
在上下文中主动演讲者
Active Speakers in Context
论文作者
论文摘要
当前的主动检测方法的重点是对单个扬声器的短期视听信息进行建模。尽管此策略足以解决单扬声器方案,但它可以防止在任务确定许多候选人说话的人何时准确检测。本文介绍了主动的扬声器环境,这是一种新颖的表示,该表达式在长期范围内建模了多个扬声器之间的关系。我们的主动扬声器环境旨在从视听观察的结构化合奏中学习成对和时间关系。我们的实验表明,结构化功能集合已经使主动扬声器检测性能受益。此外,我们发现所提出的主动扬声器环境改善了AVA-ACTIVESPEAKER数据集的最先进,可实现87.1%的地图。我们提出了消融研究,以验证该结果是我们长期多演讲者分析的直接结果。
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and temporal relations from an structured ensemble of audio-visual observations. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance. Moreover, we find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP of 87.1%. We present ablation studies that verify that this result is a direct consequence of our long-term multi-speaker analysis.