Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

Title:Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

Authors:Camille Coti, Laure Petrucci and Daniel Alberto Torres Gonzalez

Conference:Euro-Par2021

Tags:fault tolerance, high performance computing, linear algebra, LU and matrix factorizations

Abstract:

Large scale architectures provide us with high computing power, but as the size of the systems grows, computation units are more likely to fail. Fault-tolerant mechanisms have arisen in parallel computing to face the challenge of dealing with all possible errors that may occur at any moment during the execution of parallel programs. Algorithms used by fault-tolerant programs must scale and be resilient to software/hardware failures. Recent parallel algorithms have demonstrated properties that can be exploited to make them fault-tolerant. In my thesis, I design, implement and evaluate parallel and distributed fault-tolerant numerical computation kernels for dense linear algebra. I take advantage of intrinsic algebraic and algorithmic properties of communication-avoiding algorithms in order to make them fault-tolerant. I am focusing on dense matrix factorization kernels: I have results on LU and preliminary results on QR. Using performance evaluation and formal methods, I am showing that they can tolerate crash-type failures, either re-spawning new processes on-the-fly or ignoring the error.