论文标题
评估端到端生成系统的字幕细分
Evaluating Subtitle Segmentation for End-to-end Generation Systems
论文作者
论文摘要
字幕在屏幕上以短文本的形式出现,根据形式约束(长度)和句法/语义标准进行分割。可以通过人类参考的序列分割指标评估字幕分割。但是,当系统生成与参考不同的输出时,无法应用标准分割指标,例如具有端到端的字幕系统。在本文中,我们研究了如何进行基于参考的分割精度评估的方法,而与文本内容无关。我们首先对现有指标进行系统分析,以评估字幕细分。然后,我们引入$ SIGMA $,这是一个新的字幕分割评分,它来自BLEU在分段边界上的近似上限,这使我们能够从文本质量中解散良好细分的效果。要将$ sigma $与现有指标进行比较,我们进一步提出了一种从不完美的假设到真实参考的边界投影方法。结果表明,所有指标都能够奖励高质量的输出,但是对于类似的输出,系统排名取决于每个度量标准对错误类型的敏感性。我们的详尽分析表明,$ sigma $是一个有希望的细分候选者,但其对其他细分指标的可靠性仍有与人类判断的相关性来验证。
Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this paper, we study ways to conduct reference-based evaluations of segmentation accuracy irrespective of the textual content. We first conduct a systematic analysis of existing metrics for evaluating subtitle segmentation. We then introduce $Sigma$, a new Subtitle Segmentation Score derived from an approximate upper-bound of BLEU on segmentation boundaries, which allows us to disentangle the effect of good segmentation from text quality. To compare $Sigma$ with existing metrics, we further propose a boundary projection method from imperfect hypotheses to the true reference. Results show that all metrics are able to reward high quality output but for similar outputs system ranking depends on each metric's sensitivity to error type. Our thorough analyses suggest $Sigma$ is a promising segmentation candidate but its reliability over other segmentation metrics remains to be validated through correlations with human judgements.