了解能力驱动的扩展神经建议推理

论文标题

了解能力驱动的扩展神经建议推理

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

论文作者

Lui, Michael, Yetim, Yavuz, Özkan, Özgür, Zhao, Zhuoran, Tsai, Shin-Yeh, Wu, Carole-Jean, Hempstead, Mark

论文摘要

深度学习推荐模型已增长到Terabyte量表。传统的服务方案 - 将整个型号加载到一个服务器上 - 无法支持此量表。支持此量表的一种方法是使用分布式服务或分布式推理，该方法将单个大型模型的内存要求划分为多个服务器。考虑到庞大的系统设计空间，这项工作是系统研究社区开发新型模型服务解决方案的第一步。大规模的深度推荐系统是一种新颖的工作量，对研究至关重要，因为它们消耗了数据中心中所有推理周期的79％。为此，这项工作描述并描述了使用数据中心服务基础架构进行扩展深度学习建议推理。与其他最近作品的面向吞吐量的培训系统相比，这项工作专门探讨了延迟的推理系统。我们发现，分布式推理的延迟和计算开销很大程度上是模型的静态嵌入表分布和输入推进请求的稀疏性的结果。我们进一步评估了三种类似DLRM模型的三种嵌入式桌面映射策略，并在端到端的延迟，计算开销和资源效率方面指定了具有挑战性的设计权衡。总体而言，当数据中心量表建议模型以分布式推理方式提供时，我们只观察到边际延迟开销 - 在最佳情况下，P99潜伏期仅增加1％。延迟开销在很大程度上是所使用的商品基础设施和嵌入表的稀疏性的结果。更令人鼓舞的是，我们还展示了分布式推理如何解释数据中心规模推荐服务的效率提高。

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner--P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

下载PDF全文

下载文献需遵守相关版权规定

论文标题