具有不确定性加权的非线性上下文强盗和马尔可夫决策过程的腐败活力算法

论文标题

具有不确定性加权的非线性上下文强盗和马尔可夫决策过程的腐败活力算法

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

论文作者

Ye, Chenlu, Xiong, Wei, Gu, Quanquan, Zhang, Tong

论文摘要

尽管在对抗性腐败方面对强化学习（RL）问题产生了重大兴趣和进展，但当前的作品要么局限于线性设置，要么导致不需要的$ \ tilde {o}（\ sqrt {t}ζζ）$遗憾，其中$ t $ t $是回合的数量和$ζ$是腐败的总额。在本文中，我们考虑具有一般函数近似的上下文匪徒，并提出了一种计算有效算法，以使$ \ tilde {o}（\ sqrt {t}+ζ）$遗憾。所提出的算法依赖于最近开发的不确定性加权最小二乘从线性上下文匪徒回归，并且是通用函数类别的不确定性的新加权估计量。与在很大程度上依赖线性结构的现有分析相反，我们开发了一种新型技术来控制加权不确定性的总和，从而确立了最终的遗憾界限。然后，我们将算法概括为情节MDP设置，并在常规函数近似方案中首先实现对损坏级别$ζ$的添加依赖性。值得注意的是，我们的算法达到后悔的界限几乎与性能下降相匹配，或者在所有腐败水平以及已知和未知$ζ$案例中都改善了现有方法。

Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}ζ)$ regret bound, where $T$ is the number of rounds and $ζ$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+ζ)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $ζ$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $ζ$ cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题