View: session overviewtalk overview
| 08:00 | The Christoffel-Darboux Basis for Krylov Methods PRESENTER: Stephen Thomas ABSTRACT. We introduce a Christoffel-Darboux polynomial basis for Krylov methods. The construction is closest in spirit to Newton-Leja GMRES: a polynomial panel is generated first, and the Arnoldi relation is recovered afterward. The difference is that Newton-Leja uses a two-term recurrence based on interpolation nodes, whereas the Christoffel-Darboux construction uses the three-term Jacobi recurrence associated with the spectral measure induced by the matrix and starting residual. For Hermitian problems, this recurrence generates a measure-adapted orthonormal polynomial basis. The three-term recurrence gives constant live memory during basis generation, because each new vector depends only on the two previous recurrence vectors. Gauss quadrature exactness gives an identity local Gram matrix in exact arithmetic. The Christoffel-Darboux kernel gives the corresponding cardinal basis, which indexes the same Krylov subspace by Gauss nodes rather than by polynomial degree. After the recurrence panel is generated, synchronization is deferred to the panel boundary, where the basis is orthogonalized and the Arnoldi Hessenberg matrix is recovered by a similarity transformation. Numerical examples compare the Christoffel-Darboux basis with Newton-Leja and related polynomial bases on Hermitian, nonnormal, and wave-propagation problems, including GPU experiments on AMD MI300X hardware. |
| 08:25 | A Unified Two-Grid Framework for Anderson Acceleration and Nonlinear GMRES PRESENTER: Satchel Lefebvre ABSTRACT. In this work, we recast two widely used accelerating methods, Anderson Acceleration (AA) and Nonlinear GMRES (NGMRES) as a two-grid scheme. Within this reformulation, we propose new variations to each method and establish a general two-grid accelerator. This includes expanding the dimension of the coarse space, as well as utilizing both pre- and post-smoothing in each approach. We offer some brief discussion on the connections to existing theory, and why an expanded coarse space suggests better convergence rates. Through numerical experiments, we highlight the advantages of the new variations, while also demonstrating counter examples that suggest the need for further analysis. |
| 10:20 | FlexTrace: Exchangeable Randomized Trace Estimation for Matrix Functions PRESENTER: Madhusudan Madhavan ABSTRACT. We consider the task of estimating the trace of a matrix function, ${\rm tr}(f({\bf A}))$, of a large symmetric positive semi-definite matrix ${\bf A}$. This problem arises in multiple applications, including kernel methods and inverse problems. A key challenge with existing trace estimation methods is the need for matrix-vector products (matvecs) with $f({\bf A})$, which can be significantly more expensive to approximate than matvecs with ${\bf A}$. In this paper, we introduce a novel trace estimator, FlexTrace, an exchangeable single-pass method that estimates ${\rm tr}(f({\bf A}))$ solely using matvecs with ${\bf A}$. We consider the case where $f$ is an operator monotone matrix function with $f(0)=0$, which includes functions such as $\log(1+x)$ and $x^{1/2}$, and derive probabilistic bounds showcasing the theoretical advantages of FlexTrace. Numerical experiments across synthetic examples and applications demonstrate that FlexTrace provides more accurate estimates of the trace of $f({\bf A})$ compared to existing methods. |
| 10:45 | Markov chains, graph Laplacians, and column subset selection PRESENTER: Mark Fornace ABSTRACT. The column subset selection problem (CSSP) is an essential target of research in matrix low-rank approximation. I will describe how our particular motivation in reduced order models of Markov chains has led to new theory and algorithms for this long-standing problem. Our problem setting, relevant to ubiquitous dynamical systems in chemistry, physics, and computer science, is nicely mappable to performing column subset selection on the (suitably regularized) inverse of the respective graph Laplacian. Using contour integral and probabilistic arguments, I will first show how the dynamical error of reversible Markov chain compression can be formulated and simply bounded in terms of Nyström approximation. Then I will describe how nuclear maximization is an especially attractive method for the CSSP in this setting, showing new matrix-free algorithms based on randomized diagonal estimation via approximate Cholesky factorization. I will demonstrate our algorithms' empirical performance and theoretical guarantees, detailing a novel combination of ideas from submodularity, probability theory, and determinantal point processes. Time permitting, I will describe our success in extending our iterative methods to the general approximation of matrices and tensors via interpolative decomposition. |
| 11:10 | Randomized Fast Subspace Descent Methods PRESENTER: Zixiao Yang ABSTRACT. We study randomized fast subspace descent (RFASD) methods for an unconstrained convex minimization problem. The method is built on a space decomposition that is stable in the $A-$norm, where A is a symmetric positive definite preconditioner. At each iteration, a subspace is selected using weights related to the local Lipschitz constant. We prove that RFASD converges linearly for strongly convex problems. Comparing to coordinate descent type methods, our numerical experiments on a logistic regression problem indicate that using a multilevel subspace can reduce the number of iterations per subspace. However, RFASD can be more sensitive to randomness, and the performance when using permutation sampling shows outliers compared with a cyclic FASD approach where the subspaces are ordered sequentially. |
| 11:35 | Randomized Sketch-and-Project Methods for Quadratic Minimax Problems PRESENTER: Ruhui Jin ABSTRACT. We develop a unified randomized sketch-and-project framework for two-player zero-sum quadratic minimax optimization. The method applies sketch-and-project to player-wise first-order information, so that each player projects onto a randomly sketched condition from its own local view under a user-chosen metric. The framework is driven by two ingredients: the geometry assigned to each player and the sketching distribution. By varying them, we obtain a family of practical algorithms for quadratic minimax problems, including Kaczmarz gradient descent-ascent (GDA), along with player-wise randomized coordinate and Newton-type methods. These variants can be viewed as the minimax or player-centric extensions of classical sketch-and-project schemes for solving large-scale linear systems. We also clarify the connections with stochastic GDA and primal-dual coordinate methods. Our framework accommodates block and importance samplings, as well as general sketching designs, and does not require step-size tuning. Focusing on quadratic objectives, we establish linear convergence in expectation to a first-order stationary point under explicit structured coupling conditions, without assuming convex-concavity. The general convergence theorem further yields simplified rates for the key algorithmic variants. Numerical experiments on synthetic and imaging problems corroborate the theory. |
| 15:20 | Using Half Precision for Matrix Splitting Iterations PRESENTER: Neil Lindquist ABSTRACT. Matrix splitting methods are used to solve the block sparse linear equations stemming from certain scientific and engineering applications. To accelerate these solves, we experiment with storing off diagonal elements in half-precision, thus reducing the memory motion while still achieving surprisingly robust convergence. Our approach is based on iterative refinement and its relation to stationary iterations. We present both theoretical and experimental results showing that the proposed technique can achieve double-precision accuracy faster than a fully double-precision solver, even when incorporating half-precision. |
| 15:45 | Mixed-precision LOBPCG on GPUs PRESENTER: Jean-Luc Fattebert ABSTRACT. Many scientific applications are implemented in double precision (64 bits), almost by default. But recent concerns over energy budgets and trends in hardware architecture driven by Artificial Intelligence (AI) are leading to research into using mixed precision or lower precision algorithms. Our research focus is on large-scale iterative eigenvalue solvers, specifically for the important case of large sparse symmetric matrices. We are investigating mixed-precision algorithms to solve large-scale eigenvalue problems on modern HPC hardware, in particular the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). Using a C++ Kokkos-based implementation, we get portability and low precision support on GPUs, enabling investigations into algorithm stability, performance and energy use. Numerical precision requirements can be highly dependent on specific applications. To demonstrate practical benefits of mixed-precision, we target an application where large-scale eigensolvers are key computational components: Density Functional Theory (DFT), used in first-principles simulations for modeling in materials science and chemistry. Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy. |
| 16:10 | Massively Parallel Domain Decomposition Preconditioner on a GPU: Efficient Implementation and Fine-Tuning PRESENTER: Mieszko Grodzicki ABSTRACT. Exploiting GPU processing power for numerical computation requires regular data structures and algorithms with a high degree of parallelism. In the context of PDE solvers, multigrid methods have been successfully adapted to meet these demands, whereas domain decomposition methods remain relatively unexplored on GPUs. In this paper, we develop and analyze a PyTorch-based GPU implementation of a massively parallel domain decomposition preconditioner based on [1]. To overcome bottlenecks in PyTorch (e.g., in block sparse matrix-vector products), we design custom CUDA kernels that attain 70 − 90% of the measured peak memory bandwidth of an NVIDIA A100 GPU. To further enhance performance, we leverage mixed-precision computation and fine-tune the preconditioner’s parameters, such as subdomain and coarse problem sizes. The resulting solver takes a sparse matrix as input, enabling straightforward handling of variable or high-contrast coefficients and non-uniform meshes. The memory footprint of the solver is relatively low: systems with up to 100M DoFs in 2D and 50M DoFs in 3D can be solved on a single A100 with 80GB of memory. Moreover, the preconditioner outperforms NVIDIA’s AmgX in all tested scenarios, with a maximum speedup of 4.92× observed in 3D for higher-order elements. [1] M. Dryja and P. Krzyżanowski, “A massively parallel nonoverlapping additive Schwarz method for discontinuous Galerkin discretization of elliptic problems,” Numer. Math., vol. 132, no. 2, pp. 347–367, 2016. |
| 16:35 | Block Jacobi Preconditioning on Analog Hardware PRESENTER: Shikhar Shah ABSTRACT. As scientific computing and machine learning push the limits of digital hardware, analog in-memory computing has emerged as a promising path toward energy-efficient and high-throughput linear algebra. Analog computing devices leverage non-volatile resistive memory to perform linear algebra operations directly via physical laws, such as Ohm’s and Kirchhoff’s laws, enabling matrix-vector products to be executed in time independent of the matrix sparsity. Although these operations are inherently approximate and stochastic, robust numerical algorithms make analog hardware particularly attractive for large-scale problems, especially the solution of large, sparse linear systems. In this talk, we present theoretical and experimental convergence results for a block Jacobi preconditioner implemented on analog hardware. We systematically compare construction methods, including the sparse approximate inverse and the Monte Carlo approximate inverse, and benchmark performance on the Poisson and convection-diffusion-reaction equations. A key finding is that optimal preconditioner parameters differ substantially from those established for digital architectures, suggesting that porting existing algorithms to analog platforms is insufficient and that new theory is needed. Based on these insights, we introduce a hierarchical two-level preconditioner scalable to very large systems and well-suited to the constraints of analog hardware. |