GPU的空间共享用于自动调整DNN型号

论文标题

GPU的空间共享用于自动调整DNN型号

Spatial Sharing of GPU for Autotuning DNN models

论文作者

Dhakal, Aditya, Cho, Junguk, Kulkarni, Sameer G., Ramakrishnan, K. K., Sharma, Puneet

论文摘要

GPU用于培训，推理和调整机器学习模型。但是，深度神经网络（DNN）的利用能力差异很大。 GPU的空间共享可以使GPU上的多个DNN多路复用并可以改善GPU利用率，从而改善吞吐量并降低潜伏期。 DNN模型只有适量的GPU资源仍然可以提供低推断潜伏期，就像将所有GPU用于推理任务一样。改善DNN推断的一种方法是调整DNN模型。自动调整框架根据训练有素的机器学习模型找到了特定目标设备的最佳低级实现，从而减少了DNN的推理潜伏期和增加推理吞吐量。我们观察到调谐模型与其推断潜伏期之间的相互依赖性。当通过接近相同数量的GPU资源推断时，使用特定GPU资源调整的DNN模型可提供最佳的推断潜伏期。一旦GPU资源限制推理，使用GPU资源最大数量的模型的推理潜伏期较差。另一方面，使用适当数量的GPU资源调整的模型仍然可以在广泛的GPU资源可用性中达到良好的推断潜伏期。我们探讨了以不同数量的GPU资源来调整模型的原因。我们提出了许多技术，以最大程度地利用资源利用并改善调整性能。我们使GPU受控的空间共享能够在GPU上复用多个调整应用程序。我们扩展了调整服务器实例，并在多个客户端实例上划分调整模型，以同时调整模型的不同运算符，从而实现了更好的GPU多路复用。随着我们的改进，我们将DNN自动调整时间降低了75％，并将吞吐量提高了5倍。

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference is tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an interdependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. While a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the causes that impact the tuning of a model at different amounts of GPU resources. We present many techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by up to 75 percent and increase throughput by a factor of 5.

下载PDF全文

下载文献需遵守相关版权规定

论文标题