论文标题
弹性批处理作业的预测自动化器
A Predictive Autoscaler for Elastic Batch Jobs
论文作者
论文摘要
与传统的在线服务相比,深度学习,HPC和SPARK等大型批量工作需要更多的计算资源和更高的成本。与其他时间序列数据的处理一样,这些工作具有多种特征,例如趋势,爆发和季节性。云提供商提供短期实例,以实现可扩展性,稳定性和成本效益。鉴于加入集群和初始化引起的时间滞后,拥挤的工作负载可能会导致调度系统中的违规行为。基于以下假设:在云环境中有可用的无限资源和理想的位置,我们建议使用训练有素的回归模型为客户和过度置换实例提供弹性接口。我们有助于一种方法,将连续空间中的异质资源需求嵌入离散的资源存储库中,并为自动记录库做出预测性扩展资源存储桶计数时间序列的计划。我们对生产资源使用数据数据的实验评估验证了解决方案,结果表明,预测性自动制剂减轻了制定缩放计划的负担,避免以较低的成本启动时间,并胜过其他具有精细设置的预测方法。
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service. Like the processing of other time series data, these jobs possess a variety of characteristics such as trend, burst, and seasonality. Cloud providers offer short-term instances to achieve scalability, stability, and cost-efficiency. Given the time lag caused by joining into the cluster and initialization, crowded workloads may lead to a violation in the scheduling system. Based on the assumption that there are infinite resources and ideal placements available for users to require in the cloud environment, we propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances based on the trained regression model. We contribute to a method to embed heterogeneous resource requirements in continuous space into discrete resource buckets and an autoscaler to do predictive expand plans on the time series of resource bucket counts. Our experimental evaluation of the production resources usage data validates the solution and the results show that the predictive autoscaler relieves the burden of making scaling plans, avoids long launching time at lower cost and outperforms other prediction methods with fine-tuned settings.