论文标题

用于学习的功能选择,以预测与决策支持应用的计算集群作业的结果

Feature Selection for Learning to Predict Outcomes of Compute Cluster Jobs with Application to Decision Support

论文作者

Okanlawon, Adedolapo, Yang, Huichen, Bose, Avishek, Hsu, William, Andresen, Dan, Tanash, Mohammed

论文摘要

我们提出了一个机器学习框架和一个新的测试床,用于从Slurm工作负载管理器进行高性能计算(HPC)群集的数据挖掘。重点是找到一种选择功能来支持决策的方法:帮助用户决定是通过增强的CPU和内存分配重新提交失败的作业,还是将其迁移到计算云。这项任务既是监督分类和回归学习,尤其是适合加强学习的顺序问题。选择相关功能可以提高训练准确性,减少训练时间并使用智能系统来解释预测和推论。我们提出了一种有监督的学习模型,该模型使用三种不同的技术来选择功能:线性回归,Lasso和Ridge回归。我们的数据集代表了失败和成功的HPC作业,因此我们的模型可靠,不太可能过度且可推广。我们的模型以99 \%精度获得了95 \%的R^2。我们确定了CPU和内存属性的五个预测指标。

We present a machine learning framework and a new test bed for data mining from the Slurm Workload Manager for high-performance computing (HPC) clusters. The focus was to find a method for selecting features to support decisions: helping users decide whether to resubmit failed jobs with boosted CPU and memory allocations or migrate them to a computing cloud. This task was cast as both supervised classification and regression learning, specifically, sequential problem solving suitable for reinforcement learning. Selecting relevant features can improve training accuracy, reduce training time, and produce a more comprehensible model, with an intelligent system that can explain predictions and inferences. We present a supervised learning model trained on a Simple Linux Utility for Resource Management (Slurm) data set of HPC jobs using three different techniques for selecting features: linear regression, lasso, and ridge regression. Our data set represented both HPC jobs that failed and those that succeeded, so our model was reliable, less likely to overfit, and generalizable. Our model achieved an R^2 of 95\% with 99\% accuracy. We identified five predictors for both CPU and memory properties.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源