论文标题
通过新闻摘要评估大语言模型的事实一致性
Evaluating the Factual Consistency of Large Language Models Through News Summarization
论文作者
论文摘要
尽管事实证明,大型语言模型(LLMS)在各种任务中有效,但也已知它们会幻觉。为了衡量LLM是否更喜欢实际的输入延续,我们提出了一个名为FIB(事实不一致基准)的新基准,该基准侧重于汇总任务。具体而言,我们的基准涉及将LLM分配的分数与实际上一致的分数进行比较,而不是实际上不一致的摘要。对于事实一致的摘要,我们使用人工写的参考摘要,这些摘要是我们手动验证实际上一致的。为了产生实际上不一致的摘要,我们从一系列摘要模型中产生摘要,这些摘要模型是我们手动注释的事实不一致的。然后根据模型的事实一致性根据其准确性进行测量,即文档的比例,该文件比例为实际一致的摘要分配了更高的分数。为了验证FIB的实用性,我们评估了来自六个不同模型系列(包括Bloom and Opt)的23个大语言模型。我们发现,现有的LLM通常为事实一致的摘要分配了更高的分数,而不是事实不一致的摘要。但是,如果实际上不一致的摘要在文档中逐字出现,则LLMS比实际上一致的摘要为这些事实不一致的摘要分配了更高的分数。我们在基准中验证了设计选择,包括评分方法和干扰物摘要的来源。我们的代码和基准数据可在https://github.com/r-three/fib上找到。
While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at https://github.com/r-three/fib.