Exploring the Impact of Node Failures on the Resource Allocation for Parallel Jobs

Title:Exploring the Impact of Node Failures on the Resource Allocation for Parallel Jobs

Authors:Ioannis Vardas, Manolis Ploumidis and Manolis Marazakis

Conference:Euro-Par2021

Tags:failure aware resource allocation, Fault-aware resource allocation, high performance computing, Impact of node failures on MPI parallel jobs, Synthetic node failure trace generation and weibull distribution

Abstract:

Increasing the size and complexity of modern HPC systems also increases the probability of various types of failures. Failures may disrupt application execution and waste valuable system resources due to failed executions. In this work, we explore the effect of node failures on the completion times of MPI parallel jobs. We introduce a simulation environment that generates synthetic traces of node failures, assuming that the times between failures for each node are independently distributed, each node following the same distribution but with different parameters. We also present a resource allocation approach that considers node failure probabilities for various system partitions before assigning resources to a job. We compare the proposed approach with Slurm's resource allocation and a failure-oblivious heuristic that randomly selects the partition for a job. We present results for a case study that assumes a 4D-torus topology and a Weibull distribution for each node's time between failures. This case study considers several different traces of node failures, capturing different failure patterns. Our results show little benefit for jobs of relatively short duration. For longer jobs though, the decrease in the time needed to complete a batch of identical jobs is quite significant when compared with Slurm or the failure-oblivious heuristic, up to 82% depending on parameters of the simulated trace.