CCIMM26: COLORADO CONFERENCE ON ITERATIVE AND MULTIGRID METHODS
PROGRAM FOR MONDAY, JUNE 22ND
Days:
previous day
next day
all days

View: session overviewtalk overview

08:00-09:40 Session 4: Krylov methods for linear and nonlinear problems
08:00
Newton--Leja GMRES and the Polar Filter
PRESENTER: Stephen Thomas

ABSTRACT. The $s$-step Krylov framework reduces global synchronizations from $O(N)$ to $O(m/s)$ over $N$ iterations by computing $s$ basis vectors per communication round. The central question is whether truncated orthogonalization to depth $m$ can be performed without numerical instability. This talk answers affirmatively through a single organizing principle: off-diagonal decay of the Krylov block Gram matrix.

For the Newton--Leja basis the decay is geometric, at a rate controlled by the logarithmic capacity of the spectral set, giving storage $O(msn)$ and $O(m/s)$ global reductions per iteration, independent of the total iteration count.

The deeper result concerns the Arnoldi Hessenberg itself. The Newton--Leja basis induces a similarity \[ H_k \;=\; R\,N_k\,R^{-1}, \] where $N_k$ is the Newton-basis Hessenberg and $R$ is the change-of-basis matrix. Under polar preconditioning $\kap(R) = 1$, so the similarity becomes unitary: every spectral property of $H_k$ is preserved exactly in the Newton basis, and $\kap(H_k) \leq \kap(A)$. MGS-GMRES with a Newton--Leja basis does not square the condition number.

When eigenvector non-orthogonality persists, the polar filter applies Newton--Schulz iteration directly to the Krylov block, driving the block Gram matrix to the identity quadratically using only intra-block matrix products and no global reductions. Numerical experiments on strongly non-normal operators confirm recovery of convergence where block Gauss--Seidel stagnates, at equal or lower cost.

08:25
Restarted IDR($s$)-QMR with Residual-Polynomial Preconditioning for Large Nonsymmetric Linear Systems
PRESENTER: Qianqian Xue

ABSTRACT. We consider the iterative solution of large, sparse, nonsymmetric linear systems $Ax=b$. Exploiting the Hessenberg relations inherent in the IDR($s$) process, we revisit and structurally reformulate the IDR($s$)-QMR method, resulting in a restarted framework that is convenient for implementation and analysis while preserving short recurrences and low memory consumption. Building on this reconstruction, we develop two polynomial-preconditioning strategies rooted in the IDR($s$)-QMR mechanism. We derive a spectral-transformation operator $\phi(\cdot)$ from the IDR($s$) residual polynomial, using harmonic Ritz information together with Leja ordering to enhance robustness; IDR($s$) is then executed with $\phi(A)$ as the effective operator and the solution to the original system is recovered through the associated reconstruction formula. Numerical experiments at the end of the paper demonstrate that this method improves convergence and reduce the overall computational cost on representative test problems.

08:50
A Unified Two-Grid Framework for Anderson Acceleration and Nonlinear GMRES
PRESENTER: Satchel Lefebvre

ABSTRACT. In this work, we recast two widely used accelerating methods, Anderson Acceleration (AA) and Nonlinear GMRES (NGMRES) as a two-grid scheme. Within this reformulation, we propose new variations to each method and establish a general two-grid accelerator. This includes expanding the dimension of the coarse space, as well as utilizing both pre- and post-smoothing in each approach. We offer some brief discussion on the connections to existing theory, and why an expanded coarse space suggests better convergence rates. Through numerical experiments, we highlight the advantages of the new variations, while also demonstrating counter examples that suggest the need for further analysis.

09:15
New Variants of Anderson acceleration and nonlinear GMRES based on reformulated least‑squares problems

ABSTRACT. In this talk, we propose three variants of Anderson acceleration and nonlinear GMRES for general fixed-point iterations, based on modified least‑squares problems underlying both methods. To solve linear systems, we apply these new approaches to accelerate the preconditioned Richardson iteration. We establish connections between the proposed variants and both left- and right-preconditioned GMRES methods. In particular, we show that full NGMRES applied to the preconditioned Richardson iteration is equivalent to right-preconditioned GMRES, while full NGMRES equipped with the new least-squares formulation is equivalent to left-preconditioned GMRES. Furthermore, under certain conditions on the preconditioned coefficient matrix, we demonstrate an equivalence between windowed NGMRES with any depth and preconditioned GMRES. These theoretical results deepen our understanding of NGMRES for solving linear systems and clarify its relationship to classical GMRES. Finally, we present numerical results demonstrating the effectiveness of these new variants in solving both linear and nonlinear problems.

09:40-10:20Coffee and Tea Break
10:20-12:00 Session 5: Randomized Iterative Solvers
10:20
FlexTrace: Exchangeable Randomized Trace Estimation for Matrix Functions

ABSTRACT. We consider the task of estimating the trace of a matrix function, ${\rm tr}(f({\bf A}))$, of a large symmetric positive semi-definite matrix ${\bf A}$. This problem arises in multiple applications, including kernel methods and inverse problems. A key challenge with existing trace estimation methods is the need for matrix-vector products (matvecs) with $f({\bf A})$, which can be significantly more expensive to approximate than matvecs with ${\bf A}$. In this paper, we introduce a novel trace estimator, FlexTrace, an exchangeable single-pass method that estimates ${\rm tr}(f({\bf A}))$ solely using matvecs with ${\bf A}$. We consider the case where $f$ is an operator monotone matrix function with $f(0)=0$, which includes functions such as $\log(1+x)$ and $x^{1/2}$, and derive probabilistic bounds showcasing the theoretical advantages of FlexTrace. Numerical experiments across synthetic examples and applications demonstrate that FlexTrace provides more accurate estimates of the trace of $f({\bf A})$ compared to existing methods.

10:45
Markov chains, graph Laplacians, and column subset selection
PRESENTER: Mark Fornace

ABSTRACT. The column subset selection problem (CSSP) is an essential target of research in matrix low-rank approximation. I will describe how our particular motivation in reduced order models of Markov chains has led to new theory and algorithms for this long-standing problem. Our problem setting, relevant to ubiquitous dynamical systems in chemistry, physics, and computer science, is nicely mappable to performing column subset selection on the (suitably regularized) inverse of the respective graph Laplacian. Using contour integral and probabilistic arguments, I will first show how the dynamical error of reversible Markov chain compression can be formulated and simply bounded in terms of Nyström approximation. Then I will describe how nuclear maximization is an especially attractive method for the CSSP in this setting, showing new matrix-free algorithms based on randomized diagonal estimation via approximate Cholesky factorization. I will demonstrate our algorithms' empirical performance and theoretical guarantees, detailing a novel combination of ideas from submodularity, probability theory, and determinantal point processes. Time permitting, I will describe our success in extending our iterative methods to the general approximation of matrices and tensors via interpolative decomposition.

11:10
Randomized Fast Subspace Descent Methods
PRESENTER: Zixiao Yang

ABSTRACT. We study randomized fast subspace descent (RFASD) methods for an unconstrained convex minimization problem. The method is built on a space decomposition that is stable in the $A-$norm, where A is a symmetric positive definite preconditioner. At each iteration, a subspace is selected using weights related to the local Lipschitz constant. We prove that RFASD converges linearly for strongly convex problems. Comparing to coordinate descent type methods, our numerical experiments on a logistic regression problem indicate that using a multilevel subspace can reduce the number of iterations per subspace. However, RFASD can be more sensitive to randomness, and the performance when using permutation sampling shows outliers compared with a cyclic FASD approach where the subspaces are ordered sequentially.

11:35
Randomized Sketch-and-Project Methods for Quadratic Minimax Problems
PRESENTER: Ruhui Jin

ABSTRACT. We develop a unified randomized sketch-and-project framework for two-player zero-sum quadratic minimax optimization. The method applies sketch-and-project to player-wise first-order information, so that each player projects onto a randomly sketched condition from its own local view under a user-chosen metric. The framework is driven by two ingredients: the geometry assigned to each player and the sketching distribution. By varying them, we obtain a family of practical algorithms for quadratic minimax problems, including Kaczmarz gradient descent-ascent (GDA), along with player-wise randomized coordinate and Newton-type methods. These variants can be viewed as the minimax or player-centric extensions of classical sketch-and-project schemes for solving large-scale linear systems. We also clarify the connections with stochastic GDA and primal-dual coordinate methods. Our framework accommodates block and importance samplings, as well as general sketching designs, and does not require step-size tuning. Focusing on quadratic objectives, we establish linear convergence in expectation to a first-order stationary point under explicit structured coupling conditions, without assuming convex-concavity. The general convergence theorem further yields simplified rates for the key algorithmic variants. Numerical experiments on synthetic and imaging problems corroborate the theory.

12:00-14:40Lunch Break
14:40-15:20Coffee and Tea Break
15:20-17:00 Session 6: Mixed precision iterative solvers
15:20
Using Half Precision for Matrix Splitting Iterations
PRESENTER: Neil Lindquist

ABSTRACT. Matrix splitting methods are used to solve the block sparse linear equations stemming from certain scientific and engineering applications. To accelerate these solves, we experiment with storing off diagonal elements in half-precision, thus reducing the memory motion while still achieving surprisingly robust convergence. Our approach is based on iterative refinement and its relation to stationary iterations. We present both theoretical and experimental results showing that the proposed technique can achieve double-precision accuracy faster than a fully double-precision solver, even when incorporating half-precision.

15:45
Mixed-precision LOBPCG on GPUs

ABSTRACT. Many scientific applications are implemented in double precision (64 bits), almost by default. But recent concerns over energy budgets and trends in hardware architecture driven by Artificial Intelligence (AI) are leading to research into using mixed precision or lower precision algorithms. Our research focus is on large-scale iterative eigenvalue solvers, specifically for the important case of large sparse symmetric matrices. We are investigating mixed-precision algorithms to solve large-scale eigenvalue problems on modern HPC hardware, in particular the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). Using a C++ Kokkos-based implementation, we get portability and low precision support on GPUs, enabling investigations into algorithm stability, performance and energy use. Numerical precision requirements can be highly dependent on specific applications. To demonstrate practical benefits of mixed-precision, we target an application where large-scale eigensolvers are key computational components: Density Functional Theory (DFT), used in first-principles simulations for modeling in materials science and chemistry.

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy.

16:10
Massively Parallel Domain Decomposition Preconditioner on a GPU: Efficient Implementation and Fine-Tuning

ABSTRACT. Exploiting GPU processing power for numerical computation requires regular data structures and algorithms with a high degree of parallelism. In the context of PDE solvers, multigrid methods have been successfully adapted to meet these demands, whereas domain decomposition methods remain relatively unexplored on GPUs. In this paper, we develop and analyze a PyTorch-based GPU implementation of a massively parallel domain decomposition preconditioner based on [1].

To overcome bottlenecks in PyTorch (e.g., in block sparse matrix-vector products), we design custom CUDA kernels that attain 70 − 90% of the measured peak memory bandwidth of an NVIDIA A100 GPU. To further enhance performance, we leverage mixed-precision computation and fine-tune the preconditioner’s parameters, such as subdomain and coarse problem sizes. The resulting solver takes a sparse matrix as input, enabling straightforward handling of variable or high-contrast coefficients and non-uniform meshes. The memory footprint of the solver is relatively low: systems with up to 100M DoFs in 2D and 50M DoFs in 3D can be solved on a single A100 with 80GB of memory. Moreover, the preconditioner outperforms NVIDIA’s AmgX in all tested scenarios, with a maximum speedup of 4.92× observed in 3D for higher-order elements.

[1] M. Dryja and P. Krzyżanowski, “A massively parallel nonoverlapping additive Schwarz method for discontinuous Galerkin discretization of elliptic problems,” Numer. Math., vol. 132, no. 2, pp. 347–367, 2016.

16:35
Block Jacobi Preconditioning on Analog Hardware
PRESENTER: Shikhar Shah

ABSTRACT. As scientific computing and machine learning push the limits of digital hardware, analog in-memory computing has emerged as a promising path toward energy-efficient and high-throughput linear algebra. Analog computing devices leverage non-volatile resistive memory to perform linear algebra operations directly via physical laws, such as Ohm’s and Kirchhoff’s laws, enabling matrix-vector products to be executed in time independent of the matrix sparsity. Although these operations are inherently approximate and stochastic, robust numerical algorithms make analog hardware particularly attractive for large-scale problems, especially the solution of large, sparse linear systems.

In this talk, we present theoretical and experimental convergence results for a block Jacobi preconditioner implemented on analog hardware. We systematically compare construction methods, including the sparse approximate inverse and the Monte Carlo approximate inverse, and benchmark performance on the Poisson and convection-diffusion-reaction equations. A key finding is that optimal preconditioner parameters differ substantially from those established for digital architectures, suggesting that porting existing algorithms to analog platforms is insufficient and that new theory is needed. Based on these insights, we introduce a hierarchical two-level preconditioner scalable to very large systems and well-suited to the constraints of analog hardware.