论文标题

大型推荐模型中的模型大小的权衡:10000 $ \ times $压缩Criteo-TB DLM型号(仅10MB的100 GB参数)

The trade-offs of model size in large recommendation models : A 10000 $\times$ compressed criteo-tb DLRM model (100 GB parameters to mere 10MB)

论文作者

Desai, Aditya, Shrivastava, Anshumali

论文摘要

嵌入表占主导地位的工业规模推荐模型大小,最多可以使用内存的记忆。推荐数据上的一种流行和最大的公共机器学习MLPERF基准是一种深度学习推荐模型(DLRM),该模型(DLRM)对单击数据的Terabyte进行了培训。它包含100GB的嵌入式内存(25亿个参数)。 DLRS由于其纯粹的大小和相关的数据量,由于嵌入式表较大而导致的训练,推理和内存瓶颈面临困难。本文分析并广泛评估用于压缩DLRM模型的通用参数共享设置(PSS)。我们在嵌入表上获得$(1 \pmε)$近似值的可学习记忆要求显示了理论上的上限。我们的边界表明,较少的参数足以准确。为此,我们演示了PSS DLRM在Criteo-TB上达到10000 $ \ times $压缩而不会失去质量。但是,这样的压缩带有警告。它需要4.5 $ \ times $更多的迭代才能达到相同的饱和质量。该论文认为,这种权衡需要更多的调查,因为它可能非常有利。利用压缩型号的小尺寸,我们显示了4.3 $ \ times $改进的训练潜伏期,从而获得了类似的整体培训时间。因此,在小型DLRM模型与较慢收敛的系统优势之间的权衡中,我们表明量表倾向于具有较小的DLRM模型,从而导致推理,更容易的部署和类似的培训时间。

Embedding tables dominate industrial-scale recommendation model sizes, using up to terabytes of memory. A popular and the largest publicly available machine learning MLPerf benchmark on recommendation data is a Deep Learning Recommendation Model (DLRM) trained on a terabyte of click-through data. It contains 100GB of embedding memory (25+Billion parameters). DLRMs, due to their sheer size and the associated volume of data, face difficulty in training, deploying for inference, and memory bottlenecks due to large embedding tables. This paper analyzes and extensively evaluates a generic parameter sharing setup (PSS) for compressing DLRM models. We show theoretical upper bounds on the learnable memory requirements for achieving $(1 \pm ε)$ approximations to the embedding table. Our bounds indicate exponentially fewer parameters suffice for good accuracy. To this end, we demonstrate a PSS DLRM reaching 10000$\times$ compression on criteo-tb without losing quality. Such a compression, however, comes with a caveat. It requires 4.5 $\times$ more iterations to reach the same saturation quality. The paper argues that this tradeoff needs more investigations as it might be significantly favorable. Leveraging the small size of the compressed model, we show a 4.3$\times$ improvement in training latency leading to similar overall training times. Thus, in the tradeoff between system advantage of a small DLRM model vs. slower convergence, we show that scales are tipped towards having a smaller DLRM model, leading to faster inference, easier deployment, and similar training times.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源