论文标题
培训(GPU)划痕的个性化推荐系统:期待不向后
Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards
论文作者
论文摘要
个性化的推荐模型(RECSYS)是高标准服务器服务的最受欢迎的机器学习工作负载之一。培训recsys的一个关键挑战是其高内存能力需求,达到了数百GB到模型大小的TB。在recsys中,所谓的嵌入层是大多数内存使用情况的说明,因此当前系统采用混合CPU-GPU设计,使大型CPU存储器存储器库存饥饿的内存嵌入层。不幸的是,训练嵌入涉及多个内存带宽密集型操作,这与CPU记忆缓慢不符,从而导致性能开销。提议缓存的先前工作经常访问GPU内存中的嵌入式,作为将嵌入式层流量筛选到CPU内存的手段,但是本文通过这种缓存设计观察到了一些局限性。在这项工作中,我们提出了一种在设计嵌入recsys的嵌入库中的根本不同的方法。我们提出的ScratchPipe Architecture利用Recsys培训的独特属性来开发一个嵌入式缓存,不仅可以看到过去,而且还可以看到“未来”的缓存访问。 ScratchPipe利用此类属性,以确保在我们提出的高速缓存设计中“始终”捕获嵌入层的主动工作集,从而使嵌入层训练能够以GPU存储速度进行。
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.