论文标题

我们怎么知道语言模型知道?关于问题回答语言模型的校准

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

论文作者

Jiang, Zhengbao, Araki, Jun, Ding, Haibo, Neubig, Graham

论文摘要

最近的作品表明,语言模型(LM)捕获了有关事实或常识的不同类型的知识。但是,由于没有模型是完美的,因此在许多情况下,它们仍然无法提供适当的答案。在本文中,我们提出了一个问题:“我们怎么知道语言模型何时有信心,对特定查询的答案?”我们从校准的角度研究了这个问题,概率模型的预测概率的属性实际上与正确性的概率相关。我们检查了三种强大的生成模型-T5,BART和GPT-2-研究了它们在质量检查任务上的概率是否经过校准,发现答案是相对强烈的No。然后,我们检查了校准此类模型的方法,以使其置信度得分与通过微调,事后修改或调整预测的输出或输入的调整来使其与正确性的可能性更好。对各种数据集的实验证明了我们方法的有效性。我们还进行分析以研究这些方法的优势和局限性,从而阐明了在校准LMS的方法中可能进行的进一步改进。我们已经在https://github.com/jzbjyb/lm-calibration上发布了代码。

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源