View: session overviewtalk overview
08:00 | A tensor POD-ROM for parameter dependent dynamical systems ABSTRACT. A space-time-parameters structure of the parametric dynamical systems encourages the application of tensor methods for defining reduced-order models. Within a tensor-based ROM framework, traditional dimension reduction techniques like the matrix SVD yield to a low-rank tensor decomposition (LRTD). This extension of the Galerkin proper orthogonal decomposition ROMs using tensors enhances both the practical efficiency of the ROM and its suitability for rigorous error analysis when applied to parametric PDEs. I intend to present the new LRTD-ROM framework, formulate an error estimate for an abstract linear parabolic problem, and explain related open problems in multi-linear algebra. |
08:25 | Constrained Tensor Decomposition for Low-rank Modeling of Multiphysics Simulation Data PRESENTER: Eric Phipps ABSTRACT. High-dimensional data is ubiquitous across scientific computing disciplines, including plasma physics, fluids, earth systems, and mechanics. Such data is naturally represented as a tensor, consisting of the value of each simulation variable at each point in space and time. Tensor decomposition methods, akin to matrix factorization methods for two-dimensional data, facilitate powerful analysis/reduction of such data, including data compression, surrogate modeling, pattern identification, and anomaly detection. However, existing tensor decomposition methods target simple statistical error metrics, most commonly least-squares loss, resulting in low-rank models of the data that fail to faithfully represent important physics quantities of interest or invariants arising from conservation principles. In this work, we explore new formulations of two common tensor decomposition methods, the Canonical Polyadic (CP) and Tucker decompositions, that attempt to better preserve these quantities by incorporating them directly in the optimization problems that define the resulting low-rank models. We then explore solving these optimization problems through penalty methods and investigate their ability to preserve these quantities compared to their overall reconstruction accuracy. Computational results of applying this approach to CP and Tucker decomposition of data arising from simulation of plasma physics and combustion will be presented. |
08:50 | Efficient incremental Tucker decomposition for streaming scientific data PRESENTER: Saibal De ABSTRACT. The Tucker decomposition has emerged as a popular format for compressing large datasets generated from high-fidelity scientific simulations. Several software packages (Tensor Toolbox, TuckerMPI) enable computing the Tucker decomposition of static data, but relatively fewer works address compressing a streaming tensor. In this work, we develop a streaming Tucker algorithm, tailored to scientific simulations where the data tensor grows along a single time-like dimension. At any point, we seek to update the existing factorization – the Tucker core and the factor matrices – with a new tensor slice, without accessing the already incorporated tensor slices in their raw, uncompressed form. Throughout this process, we ensure that a user-specified relative error tolerance is met. We present an implementation within the TuckerMPI framework, and apply it to both synthetic and simulated (combustion) datasets. By comparing against the standard (batch) algorithm, we show that our proposed approach provides significant improvements in terms of memory usage. If the Tucker ranks stop growing along the streaming tensor mode, our approach also incurs less wall-time compared to the batch version. |
09:15 | Tensor Completion with BMD Factor Nuclear Norm Minimization PRESENTER: Fan Tian ABSTRACT. This paper is concerned with the problem of recovering third-order tensor data from limited samples. A recently proposed tensor BM-decomposition (BMD) method has been shown to efficiently compress third-order spatiotemporal data. Using the BMD, we formulate a slicewise nuclear norm penalized algorithm to recover a third-order tensor from limited observed samples. We develop an efficient alternating direction method of multipliers (ADMM) scheme to solve the resulting minimization problem. Experimental results on real data show our method to give reconstruction comparable to those of HaLRTC (Liu et al. 2012), a well-known tensor completion method, in about the same number of iterations. However, our method has the advantage of smaller subproblems and higher parallelizability per iteration. |
09:40 | Fast iterative solvers in tensor butterfly format ABSTRACT. Iterative methods for large dense linear matrix systems arising from discretization of high-frequency wave equations and oscillatory integral operators can be typically accelerated via analytical and algebraic compressions such as fast multipole methods, hierarchical matrices and butterfly algorithms. These algorithms can attain an O(NlogN) CPU complexity per iteration. Despite these successes, higher-dimensional problems can be further improved using tensor algorithms. This work builds upon a recently developed tensor extension of the matrix butterfly algorithm, called tensor butterfly algorithms, to construct a fast iterative solver for oscillatory systems. The proposed algorithm integrates the tensor butterfly algorithm into a hierarchical tensor setting, where each iteration can perform the contraction in O(N) time. |
08:00 | Communication-Computation Overlapping for Parallel Multigrid Methods ABSTRACT. Preconditioned iterative methods based on the Krylov subspace technique are widely employed in various scientific and technical computing. When utilizing large-scale parallel computing systems, the communication overhead tends to increase with the growth in the number of nodes, making its reduction a crucial challenge. In parallel finite element methods (FEM) and finite volume methods (FVM), halo communication and computation overlapping (CC-Overlapping) are commonly employed, often in conjunction with the dynamic loop scheduling feature of OpenMP. This approach has been primarily applied to sparse matrix-vector products (SpMV) and explicit solvers. Previous studies by the author have proposed reordering techniques for applying CC-Overlapping to processes involving global data dependencies, such as the Conjugate Gradient method preconditioned by Incomplete Cholesky Factorization (ICCG). Successful implementations on massively parallel supercomputers demonstrated high parallel performance, but the application of CC-Overlapping was limited to SpMV. In the present work, the author proposes a method to apply CC-Overlapping to the forward and backward substitutions of the IC(0) smoother of the parallel Conjugate Gradient method preconditioned by Multigrid (MGCG). Using up to 4,096 nodes on Wisteria/BDEC-01 (Odyssey) with A64FX, performance improvements of approximately 40+% was achieved compared to the original implementation, while improvement was 20+% on 1,024 nodes of Oakbridge-CX system with Intel Xeon CPU’s. |
08:25 | Adaptive CSR Krylov Solvers on GPU PRESENTER: Stephen Thomas ABSTRACT. Our study introduces an optimization for Sparse Matrix-Vector Multiplication (SpMV) on GPU architectures, essential for high-performance computing (HPC) environments. This method revolves around a data packing strategy that necessitates doubling the storage for non-zero elements of the sparse matrix (2x nnz(A)), achieved by replicating the vector 'x'. This approach optimizes data alignment with the GPU's L1 cache, significantly reducing the indirect addressing overhead and facilitating a 2X improvement in Vector-CSR SpMV operations. Integrated with the Message Passing Interface (MPI), our technique enhances scalability and efficiency, making it well-suited for distributed computing in large-scale HPC systems. In particular, this optimization aids inexact Krylov subspace methods, such as Iterated Gauss-Seidel GMRES (IGS-GMRES). By dynamically adapting to the evolving sparsity patterns of Krylov vectors, our approach effectively reduces both computational and memory access costs without compromising solver convergence. Achieved through the replication of 'x', this strategy serves as a form of dimensional lifting based on the mathematics of arrays (MoA) paradigm due to Prof. L. Mullin. |
08:50 | Jacobian matrix construction for fluid flow problems on GPUs using automatic differentiation PRESENTER: Steven Hamilton ABSTRACT. Generating Jacobian matrices is often a challenging aspect of solving large-scale nonlinear systems of equations using Newton's method. Analytic derivation is complicated for many systems, and such derivations must be repeated whenever new physics models are added. Methods based on approximate Jacobian-vector products, such as the Jacobian-free Newton-Krylov method, are sensitive to the scaling of equations. In this talk, we discuss the use of automatic differentiation to construct Jacobian matrices for finite element discretization of compressible computational fluid dynamics equations using the Panzer and Sacado packages from the Trilinos library. Using a forward automatic differentiation approach with dynamically allocated derivative dimensions provides tremendous flexibility but results in potential performance issues for GPU architectures. Several strategies for mitigating such performance issues will be presented, with numerical results for both CPU and Nvidia GPU computing architectures. A brief overview of experience with GPU-enabled linear solvers will also be provided. |
09:15 | FleCSI: A Task-Based Framework for HPC ABSTRACT. FleCSI is a task-based, compile-time configurable framework designed to support multi-physics application development. As such, FleCSI provides a very general set of infrastructure design patterns that can be specialized and extended to suit the needs of a broad variety of solver and data requirements. FleCSI currently supports multi-dimensional mesh topology, geometry, and adjacency information, as well as n-dimensional hashed-tree data structures, graph partitioning interfaces, and dependency closures. This presentation will provide a basic introduction to the FleCSI programming model with multigrid and hydrodynamics example applications. |
09:40 | Accelerating model-based iterative reconstructions via Non-Uniform Fast Fourier Transforms and GPUs PRESENTER: Dinesh Kumar ABSTRACT. X-ray-based computed tomography is a well-established technique for determining the three-dimensional structure of an object from its two-dimensional projections. A major challenge in many experiments is that samples can be rapidly evolving or the field of view is restricted, limiting the number of projections that can be collected. In such cases, commonly used filtered back-projection methods, which directly approximate the inverse Radon transform, often result in poor-quality reconstructions. In comparison, iterative reconstruction algorithms, such as model-based iterative reconstructions (MBIR), have demonstrated considerable success in producing high-quality reconstructions under such restrictions by incorporating a priori knowledge as a model constraint. However, MBIR implementations have typically required high-performance computing resources with hundreds of compute nodes to solve the problem in a reasonable time, which may not be readily available to experimentalists. Here, we introduce TomoCAM, which reformulates the problem in Fourier space via the central slice theorem and leverages GPUs to speed up MBIR significantly. More specifically, TomoCAM computes the Radon transform and its adjoint through non-uniform fast Fourier transforms (NUFFTs), which have lower computational complexity than the ray-tracing methods traditionally used in MBIR. This is coupled with a q-generalized Gaussian Markov random field formulation of the total variation constraint, regularizing the problem with limited projection data. We employ a method based on Nestrov’s accelerated gradients to solve the resulting optimization problem. Furthermore, we exploit asynchronous memory transfers to hide PCIe latency and maximize throughput to the GPU and on-chip shared memory to minimize the expensive readouts from global memory. Using publicly available phantoms and experimental data, we demonstrate that TomoCAM produces results similar to previous MBIR implementations in reconstruction quality, and at the same time, it is 10-15 times faster. Thus, TomoCAM allows experimentalists to obtain high-quality reconstructions from limited projection data in a reasonable time using commonly available computing resources. |
10:25 | Direct interpolative construction of quantized tensor trains ABSTRACT. Quantized tensor trains (QTTs) have recently emerged as a framework for the numerical discretization of continuous functions, with the potential for widespread applications in numerical analysis, including rank-structured solvers and preconditioners based on "quantum-inspired" algorithms such as DMRG. However, the theory of QTT approximation is not fully understood. We advance this theory from the point of view of multiscale polynomial interpolation. This perspective clarifies why QTT ranks decay with increasing depth, quantitatively controls QTT rank in terms of smoothness of the target function, and explains why certain functions with sharp features and poor quantitative smoothness can still be well approximated by QTTs. The perspective also motivates new practical algorithms for the construction of QTTs. Finally, we leverage the perspective of multiscale interpolation to offer the first direct construction of the fast Fourier transform (FFT) as a QTT operator, equipped with a priori compression guarantees. |
10:50 | Quantized tensor trains for solving Vlasov-Maxwell equations ABSTRACT. Tensor networks have been used with great success in the quantum chemistry community, solving a wide range of numerical simulation problems such as finding extremal eigenvalues and performing time integration, with reasonable accuracy while significantly reducing computational costs. For QTNs of rank $D$ representing problems of size $N$, computational costs typically scale like $\sim \mathcal{O}(\text{poly}(D) \log(N))$. Here, we investigate analogous algorithms in the context of classical PDEs using the quantized tensor network (QTN) format, a multi-scale low-rank framework which enables efficient calculation and manipulation of multi-dimensional datasets. In particular, we demonstrate solving the Vlasov-Maxwell equations using both explicit and semi-implicit time integration, showing results for common test problems in up to five dimensions. We find that QTNs of rank $D=64$ appears to be sufficient for capturing the expected physics, despite the simulations using a total of $2^{36}$ grid points and thus requiring $D=2^{18}$ for exact calculations. |
11:15 | Tensor Networks Space-Time Spectral Method for Solving the Heat Equation PRESENTER: Dibyendu Adak ABSTRACT. In this paper we present a novel approach combining space-time spectral collocation method and the tensor train format [1, 2] for solving the Heat equation. Instead of using the traditional time stepping schemes, we use the space-time formulation, which produces a linear system of the solution at all time steps, which only needs to be solved once. The time dimension is equally treated as a spatial dimension and is discretized with Chebyshev collocation points. This space- time discretization helps to achieve much higher accuracy from using small number of grid points. The disadvantage of solving a large four-dimensional linear system is overcome by using the tensor train format, as the complexity in the tensor train format is only grown linearly with the number of dimension. Finally, we also show how to incorporate the inhomogeneous boundary conditions as a low-rank tensor structure, which is crucial for conversion to the tensor train format. The superior performance of the method is demonstrated with problems of smooth and non-smooth solutions |
11:40 | Rank reduction of graph Laplacians via column selection PRESENTER: Mark Fornace ABSTRACT. Graph Laplacians appear commonly in machine learning, network systems, and the modeling of chemical kinetics. Each such Laplacian corresponds to a random walk on the respective graph. We first explain how model reduction of the associated Markov chain motivates the task of column selection for Nystrom approximation of the inverse Laplacian. We then develop a methodology for matrix-free column selection leveraging fast Laplacian solvers and randomized trace estimation. Finally, we discuss the framework’s theoretical guarantees, empirical performance, and applicability to the general class of positive semidefinite matrices. |
12:05 | Data-parallel tensor decomposition algorithms in the tensor-train format PRESENTER: Tianyi Shi ABSTRACT. The tensor-train (TT) format is a low rank tensor representation frequently used for high order tensors. Traditionally, the TT format is computed directly with all the elements in the tensor. In this talk, we propose a parallel algorithm framework that partition the tensor into subtensors and perform decomposition individually before merging back together. This type of factorization routine is ideal for distributed memory parallelism, and we have developed a couple of algorithms suitable for this framework. In particular, we focus on subtensor TT cross, a data-centric iterative method for data parallel TT decomposition. We provide theoretical guarantees, and scaling and communication analysis. Strong scaling results on synthetic and real-world datasets suggest that this algorithm scales well with the number of computing cores, with respect to both storage and timing. |
10:25 | Parallel Space-time Implicit Runge-Kutta Methods and Tensor-structure-preserving Schwarz Preconditioning for Parabolic Problems PRESENTER: Jing-Yuan Wang ABSTRACT. The classical implicit Runge-Kutta (IRK) method is a powerful method for its desirable accuracy and stability properties but is rarely used in practical applications because of its high computational cost and difficulties in preconditioning, especially on 3D unstructured grids. In this paper, we introduce a space-time coupled IRK scheme that offers high-order of accuracy and stability, and also has high degree of parallelism in both space and time based one- and two-level tensor-structure-preserving overlapping Schwarz methods. The proposed method is studied numerically for large 3D problems and we show that the convergence rate depends only mildly on the mesh size, the time step size, the number of processors, the window size, with right choice of the coarse mesh size and the subdomain solver. We compare the proposed method with the classical method and the parallel performance, including the strong/weak scalability and computational time, indicates that new method outperforms the classical method when the number of processors is large and the space-only parallelization of the classical method is a limiting factor. |
10:50 | Performance-portable p-multigrid block preconditioning for mixed FE discretizations of nearly incompressible hyperelasticity PRESENTER: Rezgar Shakeri ABSTRACT. The constitutive equation for two well-know mixed linear elasticity formulations is written in terms of the symmetric strain tensor $\bm \varepsilon(\bm u)$ and pressure-like variable $p$, and the deviatoric strain tensor $\bm\varepsilon_{\text{dev}}(\bm u)$ and hydrostatic pressure $p$. We introduce a continuous family of $\bm u-p$ mixed formulations for nearly incompressible linear elasticity that include both of these formulations. Then, we extend it to the hyperelastic incompressible materials. High order mixed finite elements with continuous and discontinuous pressure in three dimensions are used to discretize the mixed formulation, resulting in a saddle point system that is solved using block preconditioners with matrix-free $p$-multigrid using Ratel. Ratel is a portable solid mechanics library that uses operators from libCEED and solvers from PETSc to provide fast, efficient, and accurate simulations on next-generation architectures. We study the efficiency of block preconditioners within the family of mixed discretizations, with and without augmented Lagrangian techniques, using matrix-free $p$-multigrid. Solver efficiency, robustness and parallel scalability on CPU and GPU systems is assessed on multiscale lattice structures for a range of problem parameters and finite element orders. |
11:15 | Low-order Preconditioner for Steady State Advection Diffusion Equation Under Spectral Element Discretization ABSTRACT. Simulating steady state solutions for advection-diffusion has been of great interest to engineers in order to bypass long transients in thermal fluids applications in complicated domains. The steady advection-diffusion equation is the main system for conjugate heat transfer and for the linearized incompressible Navier-Stokes equations. One challenge to developing a preconditioner of such problems is the boundary layer caused by small diffusivity which raises extra stability issues. In this work, we deploy a low-order preconditioner for the steady advection diffusion operator based on the existing collocation points of the spectral element method (SEM). It is feasible to construct the full preconditioner since the number of nonzeros of the low-order matrix in $d$ space dimensions is only $O(N^d)$ rather than $O(N^{2d})$ for the $N$th-order SEM matrix (which is not formed). In the low-order formulation, the convective term is dealiased with a 3-point Gauss rule in order to ensure skew-symmetry. Several local reordering strategies are discussed to reduce the fill of the LU or ILU solver. We provide numerical results for parameter studies and compare our solver with traditional time advancing schemes for benchmark cases from nuclear engineering applications. |
11:40 | A parameter-robust preconditioner for Stokes-Darcy coupling problems ABSTRACT. In this talk, we study parameter-robust preconditioners for the coupled Stokes-Darcy system. Existing preconditioners usually involve a fractional operator due to the introduction of Lagrange multipliers or the reduction to a system including only interface flux variables. We aim to develop a parameter-robust preconditioner that does not involve a fractional operator. To this end, we strongly impose the normal flux continuity at the interface in the weak formulation. Following the well-known operator preconditioning framework, we prove the well-posedness of the coupled problems with respect to an appropriately chosen weighted norm and then develop the parameter-robust preconditioner based on the Riesz map. Moreover, we develop an inexact version of the preconditioner by utilizing multilevel techniques. Finally, numerical tests confirm the robustness of the block preconditioners with respect to the physical and discretization parameters without the need for any fractional operators. |
12:05 | Transformed primal-dual methods with variable-preconditioners for nonlinear partial differential equations PRESENTER: Jingrong Wei ABSTRACT. This paper introduces a novel Transformed Primal-Dual with variable-metric/preconditioner (TPDv) algorithm, designed to efficiently solve affine constrained optimization problems common in nonlinear partial differential equations (PDEs). Diverging from traditional methods, TPDv iteratively updates time-evolving preconditioning operators, enhancing adaptability. The algorithm is derived and analyzed, demonstrating global linear convergence rates under mild assumptions. Numerical experiments on challenging nonlinear PDEs, including the Darcy-Forchheimer model and a nonlinear electromagnetic problem, showcase the algorithm's superiority over existing methods in terms of iteration numbers and computational efficiency. The paper concludes with a comprehensive convergence analysis. |
16:30 | Stagnation-shortening inexact Newton method based on unsupervised learning for highly nonlinear hyperelasticity problems on three-dimensional unstructured meshes PRESENTER: Yujie Gong ABSTRACT. Inexact Newton method is widely used in many scientific and engineering applications, but in some situations its convergence involves a long stagnation period in which the nonlinear residual doesn't decrease much and a large percentage of the total compute time is wasted. The reasons for the stagnation are quite difficult to quantify. In this paper, we propose a stagnation-shortening inexact Newton algorithm with a stagnation analysis phase during the Newton iterations in which the norms of the residuals are used as a guiding curve and the vector values of the residuals are centralized and decomposed into a slow subspace and a regular subspace using an unsupervised learning method based on principal component analysis. A subspace Newton is then performed in the slow space whose solution is later projected back to the global space. We show numerically that with such an embedded learning phase the inexact Newton method converges almost quadratically. As an application, we consider the modeling of the human artery with stenosis using the hyperelasticity equation with multiple material parameters. Due to the significant difference in the material coefficients between the plaques and the healthy parts of the blood vessels, the problem is nonlinearly very difficult. Numerical experiments demonstrate that proposed method offers significantly reduced number of nonlinear iterations and robustness for this rather tough hyperelasticity problem. |
16:55 | Efficient Solution for Fully Implicit Runge-Kutta Methods for Time-dependent Hyperbolic PDEs PRESENTER: Aman Rani ABSTRACT. We focus on implicit Runge-Kutta (IRK) time stepping of time-dependent hyperbolic PDEs. IRK time stepping of PDEs leads to linear systems with a Kronecker product structure involving the Butcher table and the mass and stiffness matrices. The system arising from IRK-time stepping applied to the wave equation is highly ill-conditioned but has a block structure we exploit. If $N$ is the number of degrees of freedom for the discretization of the PDE, then the resulting matrix after IRK-time stepping is a block $s\times s$ matrix where each block is of size $2N \times 2N$. Traditionally, preconditioning techniques are used for solving these systems, but we reformulated the large $2Ns \times 2Ns$ block structured linear system as a Sylvester matrix equation. That leads to solving s block systems of order $2N \times 2N$ which are solved using $2$ AMG V-cycles. As we increase the Runge-Kutta stages and refine the mesh, our method is twice as fast compared to other existing methods and takes fewer AMG V-cycles. |
17:20 | Fast coarse grid solvers for the solution of exascale Poisson equations PRESENTER: Thilina Ratnayaka ABSTRACT. p-multigrid (pMG) is a widely used multilevel preconditioner for high-order spectral element and finite element methods (SEM/FEM). pMG consists of a series of nested meshes with decreasing polynomial orders starting from a fine mesh with a higher order (usually, p=7--8) to a coarse mesh with a lower order (usually, p=1). In the SEM, which is essentially a nodal-based matrix-free finite element method, finer level relaxations are performed with fast tensor product operations that are largely local and require only nearest neighbor communication. However, solving the coarse grid problem, which is essential for fast convergence, requires global all-to-all communication since the inverse of coarse operator is completely dense. The size of the coarse problem, which comprises p=1 elements on an unstructured mesh, is approximately equal to the number of elements in the SEM mesh. With exascale platforms, it is not uncommon to have more than a billion elements, so the communication-intensive coarse-grid problem presents a formidable computational challenge. Fast coarse solvers for pMG are essential for high performance SEM. The author presents a scalable two-level Schwarz method for the solution of the coarse problem that targets GPU-based exascale systems. The levels consist of a fully local problem in the p=1 space that is solved in parallel and which requires only a pair of near-neighbor data exchanges pre- and post-solve. This local solve is followed (either additively or multiplicatively) by a reduced-space solve on a very coarse system having approximately 4P unknowns, where P is the number of MPI ranks (i.e., GPUs). This reduced-space problem is solved cooperatively by all the processes. A novel contribution of this work is the introduction of a structured, non-nested, space for the reduced-space problem which enables communication-free interpolation between the p=1 and reduced spaces. We design the reduced space such that it is small enough to be solved using a fast, communication-minimal, sparse direct method while still retaining reasonable approximation properties compared to algebraic multigrid (AMG). Our sparse direct solver completes in a time that is comparable to a single all-reduce. We demonstrate the effectiveness of the proposed two level Schwarz method by comparing it to the state of the art AMG solver found in BoomerAMG by doing a series of experiments using NekRS, a highly scalable incompressible Navier-Stokes solver on GPUs. Results are presented on the Frontier exascale supercomputer at Oak Ridge Leadership Computing Facility. |