BB-ML：使用机器学习技术的基本块性能预测

论文标题

BB-ML：使用机器学习技术的基本块性能预测

BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

论文作者

Abdelkhalik, Hamdy, Aktar, Shamminuj, Arafa, Yehia, Barai, Atanu, Chennupati, Gopinath, Santhi, Nandakishore, Panda, Nishant, Prajapati, Nirmal, Turja, Nazmul Haque, Eidenbenz, Stephan, Badawy, Abdel-Hameed

论文摘要

近年来，采用了机器学习（ML）技术来预测大规模应用的性能，主要是在粗糙的水平上。相比之下，我们建议在更细的粒度上使用ML技术进行性能预测，即基本块（BB）级别（BB）级别，它们是单个条目，单个退出代码块，这些代码块被编译器用于分析，以将大型代码分解为可管理的部分。我们推断了GPU应用程序的基本块执行计数，并使用它们来预测较小输入大小的计数的大输入大小的性能。我们使用随机输入值以及应用程序的最低输入值训练泊松神经网络（PNN）模型，以了解输入和基本块计数之间的关系。实验结果表明，该模型可以准确预测16个GPU基准的基本块执行计数。当对较小的输入集进行训练时，我们在推断大型输入集的基本块计数方面的准确性为93.5％，并且在预测随机实例的基本块计数方面的精度为97.7％。在案例研究中，我们将ML模型应用于CUDA GPU基准测试，以跨应用程序进行性能预测。我们使用各种指标进行评估，包括全局内存请求以及张量核心，ALU和FMA单元的主动循环。结果证明了该模型的能力，可以预测全球和共享内存请求的平均错误率分别为0.85％和0.17％的大型数据集的性能。此外，为了解决Ampere体系结构GPU中主要功能单元的利用，我们计算了张量核心，ALU，FMA和FP64单元的活动循环，并在ALU和FMA单元中达到2.3％和10.66％的平均误差，而所有测试应用应用程序和单位的最大观察误差达到18.5％。

Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题