SOL：在云平台中安全的节点学习

论文标题

SOL：在云平台中安全的节点学习

SOL: Safe On-Node Learning in Cloud Platforms

论文作者

Wang, Yawen, Crankshaw, Daniel, Yadwadkar, Neeraja J., Berger, Daniel, Kozyrakis, Christos, Bianchini, Ricardo

论文摘要

云平台在每个服务器节点上运行许多软件代理。这些代理管理节点操作的所有方面，在某些情况下经常收集数据并做出决策。不幸的是，它们的行为通常基于预定义的静态启发式或离线分析。他们不利用节点机器学习（ML）。在本文中，我们首先表征了Azure中节点剂的光谱，并确定最有可能受益于节点ML的代理类别。然后，我们提出了SOL，这是一个可扩展的框架，用于设计基于ML的代理，这些试剂对生产中发生的故障条件范围既安全又健壮。 SOL为代理开发人员提供了一个简单的API，并管理其编写的特定特定函数的调度和运行。我们通过实现三种管理CPU内核，节点功率和内存放置的基于ML的代理来说明SOL的使用。我们的实验表明，（1）ML显着改善了我们的代理，并且（2）溶液可确保在各种故障条件下安全地操作。我们得出的结论是，基于ML的代理具有巨大的潜力，SOL可以帮助建造它们。

Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题