T5Score：生成评估指标的判别微调

论文标题

T5Score：生成评估指标的判别微调

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

论文作者

Qin, Yiwei, Yuan, Weizhe, Neubig, Graham, Liu, Pengfei

论文摘要

用于评估生成文本的现代嵌入指标通常属于两个范式之一：经过培训的歧视性指标，这些指标经过直接预测哪些输出根据受监督的人类注释质量更高，并且经过培训的生成指标，这些指标经过培训，可以根据生成模型的概率评估文本。两者都有自己的优势；判别指标能够直接对区分好输出和坏输出的问题进行优化，而可以使用丰富的原始文本对生成指标进行培训。在本文中，我们提出了一个结合两全其美的框架，使用我们提供的任何数据中的受监督和无监督的信号。我们通过训练T5Score来实现这一想法，该指标将这些培训信号与MT5用作骨干。我们与5个数据集，19种语言和280个系统上的其他现有指标进行了广泛的经验比较，证明了我们方法的实用性。实验结果表明：T5SCORE在所有数据集上都针对该细分市场上现有最高得分指标的最佳性能。我们在https://github.com/qinyiwei/t5score上发布代码和模型。

Modern embedding-based metrics for evaluation of generated text generally fall into one of two paradigms: discriminative metrics that are trained to directly predict which outputs are of higher quality according to supervised human annotations, and generative metrics that are trained to evaluate text based on the probabilities of a generative model. Both have their advantages; discriminative metrics are able to directly optimize for the problem of distinguishing between good and bad outputs, while generative metrics can be trained using abundant raw text. In this paper, we present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. We perform an extensive empirical comparison with other existing metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility of our method. Experimental results show that: T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level. We release our code and models at https://github.com/qinyiwei/T5Score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题