扩展机器学习数据流的分布式处理

论文标题

扩展机器学习数据流的分布式处理

Scaling-up Distributed Processing of Data Streams for Machine Learning

论文作者

Nokleby, Matthew, Raja, Haroon, Bajwa, Waheed U.

论文摘要

机器学习在许多领域的新兴应用涉及连续收集和从数据流中学习。将流数据实时纳入学习模型对于改善这些应用程序的推断至关重要。此外，这些应用程序通常涉及固有收集在地理分布的实体上的数据，或者是故意分布在多台机器上的数据，出于内存，计算和/或隐私原因。在此分布式流媒体设置中对模型的培训需要以协作方式解决随机优化问题，而不是物理实体之间的通信链接。与计算节点的处理能力和/或通信链接的速率相比，流媒体数据速率很高时，这提出了一个艰巨的问题：在计算能力和/或通信率的约束下，如何最好地利用传入的数据来用于分布式培训？近几十年来，已经出现了大量研究，以解决这一问题和相关问题。本文回顾了最近开发的方法，该方法着重于计算和带宽限制的制度中的大规模分布式随机优化，重点是收敛分析，这些分析明确地说明了计算，通信和流率之间的不匹配。特别是，它专注于解决：（i）分布式随机凸问题和（ii）分布式主成分分析，这是允许全局收敛的几何结构的非凸问题。对于此类方法，本文在面对高速流数据时就会讨论分布式算法设计的最新进展。此外，它审查了这些方法的基本保证，这些方法表明，存在系统的系统，可以从分布式的，以订单最佳的速率流式传输数据中学习。

Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for distributed training under constraints on computing capabilities and/or communications rate? A large body of research has emerged in recent decades to tackle this and related problems. This paper reviews recently developed methods that focus on large-scale distributed stochastic optimization in the compute- and bandwidth-limited regime, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication and streaming rates. In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that permits global convergence. For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Further, it reviews guarantees underlying these methods, which show there exist regimes in which systems can learn from distributed, streaming data at order-optimal rates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题