机器翻译指标的外部评估

论文标题

机器翻译指标的外部评估

Extrinsic Evaluation of Machine Translation Metrics

论文作者

Moghe, Nikita, Sherborne, Tom, Steedman, Mark, Birch, Alexandra

论文摘要

自动机器翻译（MT）指标被广泛用于区分相对较大的测试集（系统级评估）的机器翻译系统的翻译质量。但是，目前尚不清楚自动指标是否可靠地将良好的翻译与句子级别的不良翻译区分开（段级评估）。在本文中，我们调查了MT指标在将机器翻译组件的成功置于带有下游任务的较大平台中时的有用MT。我们在三个下游跨语言任务（对话状态跟踪，问题答案和语义解析）上评估了最广泛使用的MT指标（CHRF，COMET，BERTSCORE等）的细分级别性能。对于每个任务，我们只能访问单语任务特定的模型。我们计算指标预测良好/坏翻译的能力与翻译测试设置的最终任务的成功/失败的能力。我们的实验表明，所有指标与下游结局的外在评估都表现出可忽略的相关性。我们还发现，神经指标提供的分数主要是由于不确定的范围而无法解释的。我们将分析综合为对未来MT指标的建议，以产生标签，而不是分数，以在机器翻译和多语言语言理解之间进行更有益的互动。

Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题