论文标题
通过离散和连续的先验的梯度方法的概括界限
Generalization Bounds for Gradient Methods via Discrete and Continuous Prior
论文作者
论文摘要
梯度类型优化方法证明算法依赖性的概括误差范围最近在学习理论中引起了极大的关注。但是,大多数现有的基于轨迹的分析都需要对学习率(例如,快速降低学习率)或连续注入噪声(例如Langevin Dynamics中的高斯噪声)的限制性假设。在本文中,我们在PAC-Bayesian框架之前引入了一种新的离散数据依赖性,并证明了$ O(\ frac {1} {n} {n} \ cdot的高概率概括性结合\ sum_ {t = 1}^t(γ_t/\ varepsilon_t)^2 \ left \ | {\ Mathbf {g} _t} \ right \ |^2)floored gd(即渐变级别的梯度下降的版本)$ \ varepsilon_t $ n $ n $ n是$ n IS $ n $ n $ n $ n nords $在步骤$ t $,$ \ mathbf {g} _t $的速率大致是使用所有样本计算的梯度的差,并且仅使用先前的样本。 $ \ left \ | {\ mathbf {g} _t} \ right \ | $在上限和典型的范围比梯度范围norm norm $ \ left \ | {\ nabla f(w_t)f(w_t)} \ right \ right \ | $小得多。我们指出,我们的界限适用于非凸和非平滑场景。此外,我们的理论结果提供了测试错误的数值上限(例如,MNIST $ 0.037 $)。使用类似的技术,我们还可以为SGD的某些变体获得新的概括范围。此外,我们研究了梯度Langevin动力学(GLD)的概括界。使用相同的框架,具有经过精心构造的连续性先验,我们显示了$ O(\ frac {1} {n} + \ frac {l^2} {n^2} {n^2} \ sum_ {t = 1}^t(γ__T/σ_t)^2)$的新的高概率概括限制。新的$ 1/n^2 $费率是由于培训样本梯度和先前的梯度之间的差异浓度。
Proving algorithm-dependent generalization error bounds for gradient-type optimization methods has attracted significant attention recently in learning theory. However, most existing trajectory-based analyses require either restrictive assumptions on the learning rate (e.g., fast decreasing learning rate), or continuous injected noise (such as the Gaussian noise in Langevin dynamics). In this paper, we introduce a new discrete data-dependent prior to the PAC-Bayesian framework, and prove a high probability generalization bound of order $O(\frac{1}{n}\cdot \sum_{t=1}^T(γ_t/\varepsilon_t)^2\left\|{\mathbf{g}_t}\right\|^2)$ for Floored GD (i.e. a version of gradient descent with precision level $\varepsilon_t$), where $n$ is the number of training samples, $γ_t$ is the learning rate at step $t$, $\mathbf{g}_t$ is roughly the difference of the gradient computed using all samples and that using only prior samples. $\left\|{\mathbf{g}_t}\right\|$ is upper bounded by and and typical much smaller than the gradient norm $\left\|{\nabla f(W_t)}\right\|$. We remark that our bound holds for nonconvex and nonsmooth scenarios. Moreover, our theoretical results provide numerically favorable upper bounds of testing errors (e.g., $0.037$ on MNIST). Using a similar technique, we can also obtain new generalization bounds for certain variants of SGD. Furthermore, we study the generalization bounds for gradient Langevin Dynamics (GLD). Using the same framework with a carefully constructed continuous prior, we show a new high probability generalization bound of order $O(\frac{1}{n} + \frac{L^2}{n^2}\sum_{t=1}^T(γ_t/σ_t)^2)$ for GLD. The new $1/n^2$ rate is due to the concentration of the difference between the gradient of training samples and that of the prior.