论文标题

科学数据压缩的可扩展混合学习技术

Scalable Hybrid Learning Techniques for Scientific Data Compression

论文作者

Banerjee, Tania, Choi, Jong, Lee, Jaemoon, Gong, Qian, Chen, Jieyang, Klasky, Scott, Rangarajan, Anand, Ranka, Sanjay

论文摘要

数据压缩对于存储科学数据变得至关重要,因为许多科学应用需要存储大量数据并在处理此数据以进行科学发现。与将错误限制在主要数据的图像和视频压缩算法不同,科学家需要准确保留派生量的感兴趣(QOIS)的压缩技术。本文介绍了一种实施的物理信息压缩技术,该技术以端到端,可扩展的,基于GPU的数据压缩管道,以解决该要求。我们的混合压缩技术结合了机器学习技术和标准压缩方法。具体而言,我们将自动编码器,一个错误的有损耗压缩机组合在一起,以提供原始数据错误的保证,以及一个约束满意度后处理步骤,以将QOI保存在最小误差中(通常小于浮点误差)。 通过压缩大规模融合代码XGC产生的核融合模拟数据来证明数据压缩管道的有效性,该数据在一天内生成数百个数据。我们的方法在ADIOS框架中起作用,并导致压缩超过150倍,同时仅需要几个用于生成数据所需的计算资源,从而使整体方法对实际情况非常有效。

Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data, scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This paper presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源