ILU smoothers can be effective in the algebraic multigrid $V$-cycle. However, direct triangular solves are comparatively slow on GPUs. Previous work by Chow and Patel (2015) and Antz et al. (2015) proposed an iterative approach to solve these systems. Unfortunately, when the Jacobi iteration is applied to highly non-normal upper or lower triangular factors, the iterations will diverge. An ILU smoother is introduced for classical Ruge-Stuben C-AMG that applies row and/or column scaling to mitigate the non-normality of the upper triangular factor. Our approach facilitates the use of Jacobi iteration in place of the inherently sequential triangular solve. Because the scaling is applied to the upper triangular factor as opposed to the global matrix, it can be done locally on an MPI-rank for a diagonal block of the global matrix. Numerical results and parallel performance are presented for the Nalu-Wind pressure continuity and PeleLM nodal projection pressure solvers. An ILUT Schur complement smoother, based on Xu et al (2021), that iteratively solves the Schur system along subdomain (MPI rank) boundaries using GMRES, is also introduced to maintain a constant iteration count and improve strong-scaling. For large problem sizes, GMRES+AMG with iterative triangular solves executes at least five times faster than when using direct solves on the ORNL Summit supercomputer.
ILU Smoothers with Iterative Triangular Solvers for the GPU