论文标题
还原:内存中复制的存储存储,以快速恢复故障算法
ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
论文作者
论文摘要
容忍分布式应用程序需要机制,以恢复通过过程故障丢失的数据。在现代集群系统上,在这种失败后请求替换资源通常是不切实际的。因此,申请必须继续使用其余资源。这需要重新分配工作负载,并要求未失败的过程重新加载数据。我们提出了一个算法框架及其C ++库实现,以还原MPI程序,该程序可以在过程失败后恢复数据。通过通过适当的数据分发和复制将所有必需的数据存储在内存中,恢复的速度要比依赖于并行文件系统的标准检查点方案要快得多。由于应用程序开发人员可以指定要加载的数据,因此我们还支持缩小恢复,而不是使用备用计算节点恢复。我们在受控的,孤立的环境和实际应用中评估还原。我们的实验表明,在多达24 576个处理器的毫秒范围内,输入数据丢失的加载时间,以及广泛使用的生物信息信息技术应用程序的耐故障版本的恢复时间的大幅加速。
Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24 576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.