论文标题
改善大规模在线持续学习的信息保留
Improving information retention in large scale online continual learning
论文作者
论文摘要
鉴于从非平稳分布采样的数据流,在线持续学习(OCL)旨在在保留现有知识的同时有效地适应新数据。解决信息保留的典型方法(保留先前知识的能力)是使用新数据和重播缓冲区的混合物来保持固定尺寸和计算梯度的重播缓冲区。令人惊讶的是,最近的工作(Cai等,2021)表明,即使重播缓冲液是无限的,即使用所有过去数据计算的梯度,信息保留仍然是大规模OCL的问题。本文着重于理解和解决信息保留的这种特殊性。为了指出此问题的来源,我们从理论上表明,在每个时间步骤的计算预算有限的情况下,即使没有严格的存储限制,天真地应用SGD,并以恒定或不断降低的学习速率无法长期优化信息保留。我们建议使用移动的平均方法家族来改善非平稳目标的优化。具体而言,我们设计了自适应运动平均值(AMA)优化器和基于移动平均水平的学习率计划(MALR)。我们证明了AMA+MALR对大规模基准的有效性,包括连续定位(CLOC),Google Landmarks和Imagenet。代码将在出版后发布。
Given a stream of data sampled from non-stationary distributions, online continual learning (OCL) aims to adapt efficiently to new data while retaining existing knowledge. The typical approach to address information retention (the ability to retain previous knowledge) is keeping a replay buffer of a fixed size and computing gradients using a mixture of new data and the replay buffer. Surprisingly, the recent work (Cai et al., 2021) suggests that information retention remains a problem in large scale OCL even when the replay buffer is unlimited, i.e., the gradients are computed using all past data. This paper focuses on this peculiarity to understand and address information retention. To pinpoint the source of this problem, we theoretically show that, given limited computation budgets at each time step, even without strict storage limit, naively applying SGD with constant or constantly decreasing learning rates fails to optimize information retention in the long term. We propose using a moving average family of methods to improve optimization for non-stationary objectives. Specifically, we design an adaptive moving average (AMA) optimizer and a moving-average-based learning rate schedule (MALR). We demonstrate the effectiveness of AMA+MALR on large-scale benchmarks, including Continual Localization (CLOC), Google Landmarks, and ImageNet. Code will be released upon publication.