论文标题

并非所有错误都是平等的:使用分层错误综合学习文本生成指标

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

论文作者

Xu, Wenda, Tuan, Yilin, Lu, Yujie, Saxon, Michael, Li, Lei, Wang, William Yang

论文摘要

是否可以建立一般和自动的自然语言生成(NLG)评估指标?现有的学习指标要么不令人满意,要么仅限于已经可以使用大型人类评级数据的任务。我们介绍了SESCORE,这是一种基于模型的度量标准,它与人类判断高度相关,而无需人类注释,通过利用新颖的,迭代的误差综合和严重性评分管道。该管道将​​一系列合理的错误应用于原始文本,并通过模拟人工判断来分配严重性标签。我们通过比较其分数与人类评分的相关性来评估SESCORE针对现有指标。 SESCORE在多个不同的NLG任务(包括机器翻译,图像字幕和WebNLG文本生成)上优于所有先前的无监督指标。对于WMT 20/21 EN-DE和ZH-EN,SESCORE将与人类判断的平均相关性从0.154提高到0.195。尽管没有收到人类宣传的培训数据,但SESCORE甚至可以达到与最佳监督公制彗星的可比性能。

Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源