可持续的在线增强学习

论文标题

可持续的在线增强学习

Sustainable Online Reinforcement Learning for Auto-bidding

论文作者

Mou, Zhiyu, Huo, Yusen, Bai, Rongquan, Xie, Mingzhou, Yu, Chuan, Xu, Jian, Zheng, Bo

论文摘要

最近，自动投标技术已成为增加广告商收入的重要工具。面对现实世界广告系统（RAS）中的复杂且不断变化的竞标环境，最先进的自动投标政策通常利用加强算法（RL）算法来代表广告商生成实时竞标。由于安全问题，人们认为只能在基于RAS中生成的历史数据构建的离线虚拟广告系统（VAS）中进行RL培训过程。在本文中，我们认为VAS和RAS之间存在很大的差距，这使得RL培训过程遭受了在线和离线之间不一致的问题（IBOO）。首先，我们正式定义IBOO并系统地分析其原因和影响。然后，为避免IBOO，我们提出了一个可持续的在线RL（SORL）框架，该框架通过直接与RAS进行互动，而不是在VAS中学习来训练自动投标政策。具体来说，根据我们对Q功能的Lipschitz平滑属性的证明，我们设计了一种安全有效的在线探索（SER）策略，以连续从RAS收集数据。同时，我们根据SER政策的安全性得出了理论下限。我们还开发了一种被方差抑制的保守Q学习方法（V-CQL）方法，以有效，稳定地学习使用收集的数据的自动铸造策略。最后，广泛的模拟和现实世界实验验证了我们的方法优于最先进的自动投标算法。

Recently, auto-bidding technique has become an essential tool to increase the revenue of advertisers. Facing the complex and ever-changing bidding environments in the real-world advertising system (RAS), state-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline (IBOO). Firstly, we formally define the IBOO and systematically analyze its causes and influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL) framework that trains the auto-bidding policy by directly interacting with the RAS, instead of learning in the VAS. Specifically, based on our proof of the Lipschitz smooth property of the Q function, we design a safe and efficient online exploration (SER) policy for continuously collecting data from the RAS. Meanwhile, we derive the theoretical lower bound on the safety of the SER policy. We also develop a variance-suppressed conservative Q-learning (V-CQL) method to effectively and stably learn the auto-bidding policy with the collected data. Finally, extensive simulated and real-world experiments validate the superiority of our approach over the state-of-the-art auto-bidding algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题