论文标题
用语言反馈培训语言模型
Training Language Models with Language Feedback
论文作者
论文摘要
预审前的语言模型通常不会以与我们的偏好相符的方式执行任务,例如产生令人反感的文本或事实不正确的摘要。最近的工作通过从人类评估的简单形式学习:对模型生成的任务输出对之间的比较来解决上述问题。比较反馈传达了关于人类评估人类偏好的有限信息。在这里,我们建议从自然语言的反馈中学习,该反馈传达了人类评估的更多信息。我们使用三步学习算法从对模型输出的语言反馈中学习。首先,我们根据初始输出和反馈来调节语言模型,以生成许多改进。其次,我们选择与反馈最高相似性的改进。第三,我们对某种语言模型进行了修订,以最大程度地提高所选精炼的可能性。在综合实验中,我们首先评估语言模型是否准确地合并了反馈以产生改进,发现只有大型语言模型(175b参数)才能做到这一点。我们的学习算法Finetunes GPT-3模型仅使用100个人称的反馈样本,以大致具有人级汇总能力。
Pretrained language models often do not perform tasks in ways that are in line with our preferences, e.g., generating offensive text or factually incorrect summaries. Recent work approaches the above issue by learning from a simple form of human evaluation: comparisons between pairs of model-generated task outputs. Comparison feedback conveys limited information about human preferences per human evaluation. Here, we propose to learn from natural language feedback, which conveys more information per human evaluation. We learn from language feedback on model outputs using a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input. In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements, finding that only large language models (175B parameters) do so. Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization ability.