Section:
New Results
A backward/forward recovery approach for the preconditioned conjugate gradient method
Participants :
Massimiliano Fasi [Univ. Mancherster, UK] , Julien Langou [UC Denver, USA] , Yves Robert, Bora Uçar.
Several recent papers have introduced a periodic verification
mechanism to detect silent errors in iterative solvers. Chen
[PPoPP’13, pp. 167-176] has shown how to combine such a verification
mechanism (a stability test checking the orthogonality of two vectors
and recomputing the residual) with checkpointing: the idea is to
verify every iterations, and to checkpoint every
iterations. When a silent error is detected by the verification
mechanism, one can rollback to and re-execute from the last
checkpoint. In this work, we also propose to combine checkpointing and
verification, but we use algorithm-based fault tolerance (ABFT) rather
than stability tests. ABFT can be used for error detection, but also
for error detection and correction, allowing a forward recovery (and
no rollback nor re-execution) when a single error is detected. We
introduce an abstract performance model to compute the performance of
all schemes, and we instantiate it using the preconditioned conjugate
gradient algorithm. Finally, we validate our new approach through a
set of simulations.
This work has been accepted for publication in the Journal of Computational Science [13].