论文标题

RL和KL处罚更好地看作是贝叶斯推断

RL with KL penalties is better viewed as Bayesian inference

论文作者

Korbak, Tomasz, Perez, Ethan, Buckley, Christopher L

论文摘要

强化学习(RL)经常用于微调大语模型(LMS)(例如GPT-3)中,以惩罚它们,以惩罚它们,因为它们的不良序列特征,例如进攻性,社会偏见,有害或虚假。 RL配方涉及将LM视为政策并进行更新,以最大程度地提高奖励功能的预期价值,该奖励功能捕获了人类偏好,例如非犯罪性。在本文中,我们分析了与将语言模型视为RL政策相关的挑战,并展示了如何避免这些挑战超越RL范式的挑战。我们首先要观察到标准RL方法是对微调LM的目标有缺陷的,因为它会导致分布崩溃:将LM变成退化分布。然后,我们分析了KL调查的RL,这是一种用于微调LMS的食谱,该配方还限制了微调的LM,以保持其原始分布,以Kullback-Leibler(KL)Divergence保持接近。我们表明,KL调查的RL等效于变异推断:近似贝叶斯后验,该后验指定如何更新先前的LM以符合奖励功能提供的证据。我们认为,这种贝叶斯对KL型RL的推论比通常使用的RL视角更有见地。贝叶斯推理观点解释了KL调查的RL如何避免分布崩溃问题,并为其目标提供了第一原理推导。尽管该目标恰好等同于RL(具有特定的参数奖励选择),但对于不再等同于RL的微调LMS还有其他目标。该观察结果导致了一个更一般的观点:RL对于诸如微调语言模型之类的问题不是足够的正式框架。最好将这些问题视为贝叶斯推断:近似预定义的目标分布。

Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. In this paper, we analyze challenges associated with treating a language model as an RL policy and show how avoiding those challenges requires moving beyond the RL paradigm. We start by observing that the standard RL approach is flawed as an objective for fine-tuning LMs because it leads to distribution collapse: turning the LM into a degenerate distribution. Then, we analyze KL-regularised RL, a widely used recipe for fine-tuning LMs, which additionally constrains the fine-tuned LM to stay close to its original distribution in terms of Kullback-Leibler (KL) divergence. We show that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which specifies how to update a prior LM to conform with evidence provided by the reward function. We argue that this Bayesian inference view of KL-regularised RL is more insightful than the typically employed RL perspective. The Bayesian inference view explains how KL-regularised RL avoids the distribution collapse problem and offers a first-principles derivation for its objective. While this objective happens to be equivalent to RL (with a particular choice of parametric reward), there exist other objectives for fine-tuning LMs which are no longer equivalent to RL. That observation leads to a more general point: RL is not an adequate formal framework for problems such as fine-tuning language models. These problems are best viewed as Bayesian inference: approximating a pre-defined target distribution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源