View: session overviewtalk overview
| 08:00 | Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs PRESENTER: Jie Ren ABSTRACT. Matrix-free finite element methods are widely used in large-scale scientific simulations due to their reduced memory footprint and favorable arithmetic intensity, making them particularly attractive for modern GPU architectures. However, achieving high performance for fully fused, matrix-free GPU kernels remains challenging, as execution is often limited by pipeline inefficiencies rather than floating-point throughput. In this work, we present \textsc{Cheetah}, a set of kernel-level optimizations that systematically redesign the execution pipeline of just-in-time compiled matrix-free finite element operators without modifying the mathematical formulation or user-facing programming model. \textsc{Cheetah} improves register reuse, reduces shared-memory traffic and CTA-level (cooperative thread array) synchronization, overlaps indirect memory accesses with computation using asynchronous copy, and exploits constant memory for invariant operator data. These optimizations are implemented transparently within the \texttt{libCEED} JIT framework and require no changes to user-provided quadrature functions. We evaluate \textsc{Cheetah} on mass-like and diffusion-like operators on an NVIDIA A100 GPU. The results demonstrate substantial performance improvements, particularly at higher polynomial orders, with up to 30\% higher throughput for mass operators and up to 50\% for diffusion operators compared to the baseline \texttt{libCEED} GPU backend. Detailed profiling shows that these gains arise from improved compute utilization, reduced synchronization overhead, increased arithmetic intensity, and more effective latency hiding. Our results highlight the importance of execution pipeline design in fully fused matrix-free GPU kernels and demonstrate that low-level kernel restructuring is essential for unlocking the full performance potential of matrix-free methods on modern GPUs. |
| 08:25 | Accelerating Implicit Contact via Matrix-Free Operators on Non-Conforming Interfaces ABSTRACT. In continuum solid mechanics, robust and accurate simulation of mesh-to-mesh contact remains a challenging problem to scale to modern GPU-based architectures. In this work, we present a novel matrix-free strategy for evaluating finite element integrals of nonlinear functions and their linearizations consistently across non-conforming and solution-dependent interfaces. To avoid assembling complex coupling matrices for segmented face intersections, we establish a single layout of dynamically determined quadrature points shared by both contacting meshes. By utilizing performance-portable libCEED operators to extract the necessary points for each surface cell, we can map quantities directly across the interface without copying the underlying vector of values. We demonstrate the efficiency of this zero-copy, matrix-free approach for mesh-to-mesh contact simulation in Ratel, our high-performance implicit solid mechanics library. |
| 08:50 | A Matrix-Free Algebraic HP-Multigrid Method for Computational Fluid Dynamics Applications PRESENTER: Peter Ohm ABSTRACT. With the advent of advanced computing architectures such as GPUs, many algorithms have required changes in order to properly utilize them. One such approach that is well-suited for GPUs is matrix-free methods. Matrix-free methods can be advantageous over traditional methods due to their memory efficiency and lower computational complexity, but with the cost of needing specialized software frameworks and solvers. Geometric multigrid methods have seen great success on GPU architecture with the regular geometric stencils naturally lending themselves to matrix-free implementations. Algebraic multigrid (AMG) methods are problematic because they usually require information about matrix entries, which may not be easily obtainable in the matrix-free framework. In this talk we discuss the implementation of a matrix-free AMG method in Neko, a portable framework for high-order spectral element framework for computational fluid dynamics simulations. We utilize an hp-multigrid approach, where the problem is first coarsened from high-order polynomials to low-order polynomials, and the resulting low-order system is further coarsened spatially in a matrix-free AMG fashion. Leveraging only mesh adjacency information, this algorithm constructs an algebraic multigrid hierarchy without requiring geometric coarsening or explicit matrix assembly, making it well-suited for GPU-accelerated architectures. Numerical results demonstrate the performance and scalability of our method and applicability to real-world unstructured-mesh applications. We highlight the advantages and shortcoming of our matrix-free AMG approach though comparison against a more traditional matrix-assembled AMG method. |
| 09:15 | Matrix free PDE-constrained optimization with basis preconditioning for gas flows in networks PRESENTER: Rowan Turner ABSTRACT. We present a matrix free interior point method (IPM) to solve the PDE-constrained optimization problem that arises from gas transport in hydrogen gas networks. The problem is highly structured, arising from both the discretization of the PDE and the gas network structures themselves, making it suitable for a matrix free setting. The heart of the method relies on a preconditioner, which is used in the inner iterations of the interior point method. To precondition the problem, we derive a basis for the constraint matrix, which corresponds to solving a problem with fixed boundary conditions/controls. We apply this preconditioner on two problems -- one which considers the governing isothermal Euler equations along with compressor equations for controlling pressure in the network, and the second which also considers integer valued valve equations for controlling flow. We present results on scalability for time horizons and mesh refinement, both in terms of the iterative solve and the outer IPM iteration count. |
| 08:00 | High-order Iterative and Multilevel Elliptic Solver for Node Sets Not Aligned with Boundaries PRESENTER: Andrew Lawrence ABSTRACT. This work combines several computational methods to produce a robust meshfree multilevel PDE solver for unfitted node sets. Building off of previous work combining meshfree RBF-FD methods with a multilevel solver, the present work incorporates Lagrange multiplier enforced boundary conditions. The Lagrange multiplier enforced boundary conditions allow for the interior node sets to remain independent of the boundary. This is particularly advantageous in systems where the boundary might be evolving in time, as the interior node set may then be defined as a subset of a much larger background node set. Points from the background node set may then be included or excluded according to the evolving boundary without having to fit the nodes at each time step. Similarly, multilevel operators along moving boundaries must change. To minimize the number of multilevel operators to be recomputed, a domain decomposition method is introduced. The multilevel method operates on an unchanging core node set while the near-boundary domain is solved by a direct method. The results of applying this method to time-independent PDEs demonstrate that the overall method preserves the linear cost of the multilevel method while maintaining the expected RBF-FD high convergence rates. Although the case of moving boundaries is not discussed, the examples which are included serve to demonstrate the viability of this combined method for future applications to time-dependent problems. |
| 08:25 | Computing the ground state and dynamics of rotational dipolar Bose-Einstein condensates PRESENTER: Fei Xue ABSTRACT. We present high-efficiency numerical algorithms for the study of rotational dipolar Bose-Einstein condensates (BECs) using Fourier pseudospectral spatial discretization. For ground state computation, we introduce a preconditioned conjugate gradient method featuring a sparse approximate Hessian preconditioner and an exact line search facilitated by fast energy evaluation. For the dynamics, we develop an axial shear method that accurately implements discrete linear flows within second-order Strang and fourth-order Yoshida time-stepping frameworks. By leveraging fast energy evaluation, these methods preserve mass and energy at a substantially reduced computational cost. Numerical comparisons demonstrate that the proposed algorithms offer significant performance gains and reliability over existing state-of-the-art methods. |
| 08:50 | Stability Analysis of Inexact Solves in Interpolatory One-Sided Projection Methods for Model Order Reduction of Linear Dynamical Systems PRESENTER: Kapil Ahuja ABSTRACT. Simulation of large linear dynamical systems can be unmanageable due to high demands on computational resources. These large systems can be reduced into a smaller dimension by using Model Order Reduction (MOR) techniques. The reduced system has approximately the same characteristics as the original system, but it requires significantly less computational effort in simulation. MOR can be done in many ways. Interpolatory projection methods are a popular class of methods to do MOR. In these MOR algorithms, sequences of very large and sparse linear systems arise during the model reduction process. Solving such linear systems is the main computational bottleneck in efficient scaling of these MOR algorithms for reducing extremely large dynamical systems. Preconditioned iterative methods are often used for solving such linear systems. These iterative methods introduce errors because they solve the linear systems up to a certain tolerance. Hence, our focus is to analyze the stability of the underlying MOR algorithms when using inexact linear solves. In the under-consideration MOR algorithms, one can do a two-sided projection or a one-sided one. Earlier works have carefully done stability analysis for the former category of algorithms; however, they have not developed methods that satisfy the conditions coming out of such analysis. On the contrary, we perform stability analysis for the algorithms belonging to the latter category, where besides performing the mirroring analysis, we also develop methods that satisfy the underlying stability conditions. We specifically work with two popular algorithms on this category. One for non-parametric systems, that is, the Adaptive Iterative Rational Global Arnoldi algorithm or the AIRGA algorithm. Another for parametric systems, that is, the Robust Parametric Model Order Reduction algorithm or the RPMOR algorithm. We prove that, under four mild conditions, these algorithms are backward stable with respect to the errors introduced by these inexact linear solves. The analysis that we do in this work can be easily extended to other non-parametric and parametric interpolatory one-sided projection algorithms. Our first condition enforces the use of a Ritz-Galerkin based linear solver, where the residual of a linear system is made orthogonal to the corresponding Krylov subspace. Since Conjugate Gradient (CG) is the most popular method based upon the Ritz-Galerkin theory, we use it. Our second condition requires satisfying few extra orthogonalities. We show how to modify CG to achieve these extra orthogonalities and further demonstrate that using Recycling CG (RCG) helps us achieve these orthogonalities with no code changes and cheaply. Our third condition involves existence and invertibility of a matrix mostly dependent upon the input dynamical system, with the norm of this matrix bounded by one. Our fourth and final condition involves being able to compute a perturbation from the derived expression and bounding its norm by one as well. The last two conditions are easily satisfied by all our test dynamical systems. |
| 09:15 | Learning-enhanced preconditioning techniques for linear and nonlinear iterative methods ABSTRACT. In traditional preconditioned iterative methods (such as CG, GMRES, Newton) for solving linear or nonlinear algebraic systems, the preconditioner is usually constructed using the information of the "system operator", but not the solution. This is reasonable since the solution is not available when the preconditioner is initially constructed. However, once the iteration starts, more and more information about the solution becomes available, and some of the information can be used to enhance the preconditioner so that the rest of the computation can be made faster. Some algorithms and applications will be discussed in the talk. |
| 10:20 | Nodal Coarsening and Sparse Ideal Interpolation for H(curl) Problems in Algebraic Multigrid PRESENTER: Taoli Shen ABSTRACT. We propose a sparse interpolation construction and a practical coarsening algorithm for the algebraic multigrid (AMG) method, tailored towards H(curl). Building on the generalized AMG framework, we introduce an interior/exterior splitting that yields both a refinement-based and a fully algebraic construction of the interpolation. The refinement-based approach follows geometric hierarchy, while the purely algebraic interpolation is constructed through a coarsening process that first coarsens a nodal dual problem and then builds coarse and fine variables using a matching algorithm. We establish the weak approximation property and the commuting relation under certain assumptions. Combined with matching block smoothers, the proposed interpolation yields an effective algebraic multilevel method. Numerical experiments show robustness under strong coefficient jumps, where the proposed methods substantially outperform standard geometric multigrid. |
| 10:45 | On spectral clustering in algebraic multigrid methods PRESENTER: Jose Pablo Lucero Lorca ABSTRACT. We introduce a new 2 × 2-block solver, obtained by clustering the spectrum of a two-level method into two a priori known points and using its minimal polynomial for the direct solution of a general complex-valued square linear system. The algorithm features a coarse space given by the Schur complement, ideal restrictions and prolongations and block-Jacobi smoothers applied before and after the coarse correction. The smoothers are nonsingular and the algorithm symmetric; consequently, when the underlying system is Hermitian, the method preserves self-adjointness. The recursive application of the algorithm to the coarse problem yields an algebraic multigrid (AMG) scheme in which the coarse solve at each level may be carried out either explicitly, through the formula induced by the known minimal polynomial and requiring two recursive applications of the algorithm, or by a Krylov method, which converges in two iterations since the spectrum consists of two points. In either case, the resulting recursion defines a Krylov W-cycle. We provide an exhaustive analysis of the algorithm, calculating its invariant subspaces and showing that the error propagation operator can be studied separately for pairs of modes, which form bases of such invariant subspaces. This allows us to give an explicit choice of smoothing parameters that makes all the pairs of modes respond identically, forcing the nonzero eigenvalues of the error propagation operator to collapse to a single, a priori known value. We illustrate the theory with direct applications to general matrices and with analyses of representative matrices arising in numerical methods. |
| 11:10 | Sparse Matrix–Vector Multiplication for Algebraic Multigrid on the Cerebras Wafer-Scale Engine PRESENTER: Maya Taylor ABSTRACT. The sparsity structure in algebraic multigrid (AMG) presents challenges for accelerators and distributed architectures, particularly at the coarse levels of the multigrid hierarchy where opportunities for parallelism are limited. Emerging spatial dataflow accelerators, such as the Cerebras Wafer-Scale Engine (WSE), achieve high bandwidth and low memory latency, making them promising hardware targets for algorithms traditionally limited by communication bottlenecks. We investigate the WSE as a platform for sustaining high throughput across AMG grid levels. Specifically, we implement a sparse matrix–vector multiplication (SpMV) algorithm for unstructured matrices on the WSE and study how spatial dataflow architectures respond to the communication patterns characteristic of AMG. We examine how matrices from different coarsening levels map onto the hardware and present observations on how sparsity structure and problem size influence performance. These results provide early insight into the suitability of wafer-scale dataflow architectures for large-scale AMG workloads, where conventional implementations are often limited by diminishing parallelism. |
| 11:35 | Structure Preserving AMG Strategies for PDE Systems, with a focus on Stokes Equations PRESENTER: Jacob Schroder ABSTRACT. Algebraic multigrid (AMG) is a popular and effective solver for sparse linear systems arising from discretized partial differential equations (PDEs). While AMG is well developed for many scalar PDE problems, AMG is less developed for PDE systems, with state-of-the-art AMG performance for even “simple” systems such as Stokes, less robust than AMG performance for classic scalar problems such as Poisson’s equation. In this talk, we discuss using structure preserving techniques to apply AMG to Stokes equations, where two variable types exist (velocity and pressure). The structure preserving approach maintains consistent coarsening relationships between variable types on coarse levels. This also guides the construction of interpolation operators and helps explain the choice of Vanka relaxation. |
| 10:20 | Can Local Data Reveal Global Fluid Dynamics? ABSTRACT. Data assimilation aims to reconstruct the state of a dynamical system by combining partial observations with a mathematical model. For fluid flows governed by the incompressible Navier-Stokes equations, classical results show that coarse observations distributed across the entire spatial domain can recover the full flow through continuous nudging, known as the Azouani-Olson-Titi 2014 algorithm. In practice, however, sensor placement is often limited, and efficient reconstruction of turbulent flows requires strategic positioning of available measurements. In this talk, we challenge the standard framework by showing that it is possible to recover the full system dynamics using only local observations from a subregion of the domain. In particular, we demonstrate that achieving global accuracy does not necessarily require global data: carefully chosen localized observations can be sufficient to synchronize the model with the true flow. This naturally raises a fundamental question: given a physical domain, should the observational region be placed near the boundary or away from it? We discuss recent theoretical results and numerical experiments that aim to shed light on this question. |
| 10:45 | Do Well-Balancing and Second-Order Accuracy Matter for River and Compound Flood Modeling? PRESENTER: Wisang Sugiarta ABSTRACT. RDycore is a finite volume solver for the shallow water equations developed within the E3SM climate modeling framework to run efficiently across CPUs and GPUs via PETSc/DMPlex and libCEED. It implements two properties standard in high-fidelity SWE solvers: well-balancing with hydrostatic reconstruction at cell interfaces, which exactly preserves the lake-at-rest state over complex bathymetry, and second-order spatial accuracy with MUSCL reconstruction with slope limiting, which reduces numerical diffusion and preserves sharp wave fronts. Both were developed and validated primarily in ocean and tsunami modeling, where they are clearly necessary. Large bathymetric gradients, fast-moving waves, and propagation over hundreds of kilometers mean that even small per-step errors accumulate into globally significant artifacts. Applying RDycore to river and compound flooding raises the natural question of whether the same justification holds in this very different regime. River and compound flooding is shallow, subcritical and friction-dominated. Rather than fast wave propagation, the dynamics are governed by slow moving propagation resulting in the persistent wetting and drying of complex floodplain terrain and compound forcing from simultaneous overflow. These physical differences raise a concrete algorithmic question: do the added complexity and cost of well-balancing and second-order reconstruction translate to meaningfully better flood predictions in this regime, or does a simpler first-order scheme without special source term treatment suffice? Because both scheme variants in RDycore are expressed as Q-functions acting on the same element data structures, the additional cost of second-order reconstruction — gradient estimation, slope limiting, and extra interface states — is cleanly isolated from mesh and communication overhead, giving a precise measure of the per-timestep algorithmic cost difference. This is particularly relevant on GPU architectures, where the cost profile of these additional operations differs from CPU, and where the long and large E3SM time-scales and operational forecasting makes performance comparisons important. We present experiments spanning three levels of complexity: an idealized channel-floodplain system with gentle bed slope, rainfall intensity, and roughness contrast and the 2013 Boulder Creek compound flood, validated against real-world USGS gauge data and more. Scheme variants are compared on flood prediction metrics — peak depth, inundation extent, arrival time — and on wall-clock cost per timestep. An performance comparison illustrates the trade-off of the methods. |
| 11:10 | Riemann Solvers and Well-Balancing in Shallow Water Modeling with RDycore PRESENTER: Mohadeseh Khoshnevisan ABSTRACT. Shallow water equations over variable topography arise in applications such as inland flooding and coastal flows like storm surge. In these settings, numerical methods must balance accuracy with computational efficiency, particularly when targeting GPU-based implementations. In this work, we investigate the impact of approximate Riemann solvers and well-balancing on both accuracy and GPU execution within RDycore, a finite-volume package for compound flooding with GPU support through libCEED. We implement HLL and HLLC solvers and compare them with Roe on a set of test problems. We focus on how different solver formulations affect both numerical behavior and implementation complexity, including the role of branching in GPU execution. In addition, we consider configurations relevant to compound flooding, including cases with high Manning coefficient and narrow channel structures, where both accuracy and numerical stability are challenging. To examine steady-state behavior, we include lake-at-rest configurations as a baseline test for well-balanced discretizations. These tests are used to evaluate how solver choices and implementation details influence both performance and accuracy in practical settings. |
| 11:35 | A new performance landscape for implicit finite-element analysis of turbulence PRESENTER: Jed Brown ABSTRACT. Many rules of thumb in computational fluid dynamics are tacitly dependent on performance models associated with implicit solvers and resolution properties of low-order discretizations. We re-examine these in the context of HONEE, a GPU-based unstructured-mesh stabilized finite-element solver for scale-resolving simulation of low-Mach and compressible turbulence. HONEE is based on matrix-free Newton-Krylov methods, for which there is minimal cost of preconditioner assembly. We revise and contrast the performance model, considering the mesh resolution, aspect ratio, order of accuracy, and time step sizes associated with scale-resolving simulation regimes, and explore stiffness, tolerances, and implicit vs IMEX time integration. |