论文标题
多相第二价格拍卖设计中的强化学习方法
A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design
论文作者
论文摘要
我们在多阶段的第二价格拍卖中研究储备价格优化,卖方的先前行动会通过马尔可夫决策过程(MDP)影响竞标者的后来估值。与现有作品中的强盗设置相比,我们的设置涉及三个挑战。首先,从卖方的角度来看,我们需要在有可能操纵卖方政策的潜在竞标者面前有效地探索环境。其次,当市场噪声分布未知时,我们希望最大程度地减少卖方的收入后悔。第三,卖方的每步收入是未知的,非线性的,甚至无法直接从环境中观察到。 我们提出了解决这三个挑战的机制。为了应对第一个挑战,我们使用了一种名为“缓冲时期”的新技术和较低开关成本的增强学习(RL)的灵感来限制投标人的盈余,从而激励大约真实的竞标。第二种是通过一种新颖的算法来解决的,该算法在市场噪声分布未知时消除了对纯探索的需求。第三个挑战是通过LSVI-UCB的扩展来解决的,在那里我们使用拍卖的基础结构来控制收入功能的不确定性。这三种技术在$ \下划线{\ rm c} $ ontextual-$ \ usewissline {\ rm l} $ svi-$ svi-$ \ usepline {\ rm u} $ cb - $ cb - $ \ $ \ useverline {\ rm b} \ Mathcal {o}}(H^{5/2} \ SQRT {K})$收入后悔,当市场噪音已知并且$ \ tilde {\ tilde {\ tilde {\ tillcal {o}}(h^{3} \ sqrt {k})$ revenue not nose noise noise noise'''''''''''''''''no noise''
We study reserve price optimization in multi-phase second price auctions, where seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially nontruthful bidders who aim to manipulates seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is unknown, nonlinear, and cannot even be directly observed from the environment. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the $\underline{\rm C}$ontextual-$\underline{\rm L}$SVI-$\underline{\rm U}$CB-$\underline{\rm B}$uffer (CLUB) algorithm which achieves $\tilde{ \mathcal{O}}(H^{5/2}\sqrt{K})$ revenue regret when the market noise is known and $\tilde{ \mathcal{O}}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.