软错误检测和基于复制的自动恢复与不同级别的检查点结合

论文标题

软错误检测和基于复制的自动恢复与不同级别的检查点结合

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

论文作者

Montezanti, Diego, Rucci, Enzo, De Giusti, Armando, Naiouf, Marcelo, Rexachs, Dolores, Luque, Emilio

论文摘要

处理故障在HPC中越来越关注。在将来的Exascale系统中，预计每天将几次静音未发现错误，从而增加了损坏结果的发生。在本文中，我们提出了SEDAR，这是一种方法，可以在运行并行通行的消息应用程序时提高针对瞬态故障的系统可靠性。我们的方法基于用于检测的过程复制，结合了不同级别的自动恢复检查点，其目的是帮助科学应用程序的用户获得具有正确结果的执行。 SEDAR的结构分为三个层次：（1）仅检测和通知的安全停机；（2）基于多个系统级检查点恢复；（3）基于单个有效的用户级检查点恢复。由于这些变体中的每一个都提供了特定的覆盖范围，但涉及局限性和实施成本，因此Sedar可以适应系统的需求。在这项工作中，介绍了该方法的描述，并且在缺失和存在的情况下，在数学上描述了采用每种SETAR策略的时间行为。引入了考虑测试应用程序上所有故障场景的模型，以显示检测和恢复机制的有效性。对每个变体进行高架评估，并使用涉及不同通信模式的应用程序进行；这也用于提取有关何时采用每个SEDAR保护水平有益的准则。结果，我们显示了其在目标HPC环境中耐受瞬态断层的功效和生存能力。

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题