论文标题
素描和比例:地理分布的TSNE和UMAP
Sketch and Scale: Geo-distributed tSNE and UMAP
论文作者
论文摘要
在地理上分布的数据集上运行机器学习分析是在数据管理政策的世界中迅速出现的问题,可确保隐私和数据安全。使用诸如T-分配的随机邻居嵌入(TSNE)和均匀的歧管近似和投影(UMAP)等工具(UMAP)的工具,可以观察到高维数据。两种工具的时间和内存都缩小较差。尽管最近的优化显示了成功处理10,000个数据点,但扩展超过百万分仍然具有挑战性。我们介绍了一个新颖的框架:素描和规模(SNS)。它利用了计数草图数据结构来压缩边缘节点上的数据,汇总了主节点上的尺寸尺寸尺寸的草图,并在摘要上运行vanilla tsne或UMAP,代表从聚合的草图中提取的最密集区域。我们表明,这项技术是完全平行的,及时线性地缩放,对数在内存和通信中,使得可以分析数百万,潜在的数十亿个数据点的数据集,分布在全球几个数据中心。我们在两个中型数据集上证明了我们方法的功能:来自肿瘤活检的多个图像的5200万35波段像素的癌症数据;以及来自斯隆数字天空调查(SDSS)的多色光度法的1亿颗恒星的天体物理学数据。
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem in the world of data management policies ensuring privacy and data security. Visualizing high dimensional data using tools such as t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP) became common practice for data scientists. Both tools scale poorly in time and memory. While recent optimizations showed successful handling of 10,000 data points, scaling beyond million points is still challenging. We introduce a novel framework: Sketch and Scale (SnS). It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary, representing the densest areas, extracted from the aggregated sketch. We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe. We demonstrate the power of our method on two mid-size datasets: cancer data with 52 million 35-band pixels from multiple images of tumor biopsies; and astrophysics data of 100 million stars with multi-color photometry from the Sloan Digital Sky Survey (SDSS).