论文标题
在按需分布式存储库中研究科学数据生命周期
Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches
论文作者
论文摘要
Xrootd系统用于从高能物理(HEP)传输,存储和缓存大数据集。在这项研究中,我们关注其作为分布式按需存储缓存的能力。通过在2020年至2021年之间探索大量的每日日志文件,我们试图了解可能为未来缓存设计提供信息的数据访问模式。我们的研究始于一组有关文件读取操作,文件寿命和文件传输的摘要统计信息。我们观察到,每个文件上的读取操作数量几乎保持恒定,而读取操作的平均大小随时间增长。此外,文件往往具有一致的时间,在此期间它们保持打开状态并正在使用。基于对缓存访问统计数据的全面研究,我们开发了一个缓存模拟器,以探索不同尺寸的缓存的行为。在一定尺寸范围内,我们发现增加XrootD高速缓存尺寸可提高缓存命中率,从而更快地访问整体文件。特别是,我们发现将高速缓存大小从40TB增加到56TB可能会将命中率从0.62提高到0.89,这是适度成本的高速缓存效率的显着提高。
The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regarding file read operations, file lifetimes, and file transfers. We observe that the number of read operations on each file remains nearly constant, while the average size of a read operation grows over time. Furthermore, files tend to have a consistent length of time during which they remain open and are in use. Based on this comprehensive study of the cache access statistics, we developed a cache simulator to explore the behavior of caches of different sizes. Within a certain size range, we find that increasing the XRootD cache size improves the cache hit rate, yielding faster overall file access. In particular, we find that increase the cache size from 40TB to 56TB could increase the hit rate from 0.62 to 0.89, which is a significant increase in cache effectiveness for modest cost.