ABSTRACT. Multigrid solvers are well suited to large massively parallel computer architectures, because they are mathematically optimal and display excellent parallelization properties. Since current architecture trends such as GPUs are favoring regular compute patterns to achieve high performance, the ability to exploit problem's structure has become much more important. In this context, an alternative to standard sparse matrix representations that is under development in Hypre is the semi-structured matrix class, which is essentially described in terms of stencils, logically rectangular grids and unstructured connections. In this talk, we will discuss our efforts on extending the semi-structured matrix class to deal with rectangular matrices as well as on implementing a new semi-structured multigrid solver based on these new data structures.

Prepared by LLNL under Contract DE-AC52-07NA27344.

xSDK: Toward a Community Software Development Kit (SDK) for High-Performance Computational Science

ABSTRACT. Numerical software libraries have proven effective in providing widely reusable software that is robust, efficient, and scalable – delivering advanced algorithms and data structures that enable scientific discovery for a broad range of applications. However, software complexity is increasing due to multiphysics and multiscale modeling, the coupling of simulations and data analytics, and the development of increasingly complex architectures. Applications increasingly require the combined use of independently developed software packages, which have diverse sponsors, priorities, and processes for development and release.

These challenges create the unique opportunity to fundamentally change how scientific software is designed, developed, and sustained---with explicit work toward scientific software ecosystems. This presentation will introduce the xSDK, or Extreme-scale Scientific Software Development Kit, where community-defined policies are increasing the quality and interoperability across numerical libraries as needed by the DOE Exascale Computing Project. The community policies developed through the xSDK project, their realized and expected benefits, and lessons learned from adoption of these policies into existing packages will be presented, as well as new features including multi-library examples to demonstrate interoperability among member packages. In particular, this talk will highlight current progress and plans for increasing interoperability among member iterative solver libraries.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Finite Element Tools for Performance Portability of Implicit and IMEX Simulations on Next Generation Architectures

ABSTRACT. Supporting scalable and performant finite element simulations across next generation heterogeneous architectures can add significant code complexity. This presentation will discuss the design of general finite element assembly tools applied to CFD, magnetohydrodynamics, and multi-fluid plasma simulations. Performance portability is achieved via the Kokkos programming model. The assembly library uses a directed acyclic graph for composable physics kernels in a multiphysics setting. Mixed basis finite element formulations are used to enforce physics constraints for Maxwell’s equations. Embedded automatic differentiation, applied via templates and operator overloading, is used for generating machine precision sensitivities for implicit and IMEX solvers. Performance results will be shown for NVIDIA GPU, and Intel Haswell architectures.

Hyper-Differential Sensitivity Analysis for Inverse Problems.

ABSTRACT. High fidelity models used in many science and engineering applications couple multiple physical states and parameters. Inverse problems arise when model parameters cannot be determined directly, but rather are estimated using (typically sparse and noisy) measurements of the states. The data are usually insufficient to simultaneously inform all of the parameter and consequently, the governing model typically contains parameters which are uncertain but must be specified for a complete model characterization necessary to invert for parameters of interest. We refer to the combination of the additional model parameters (those which are not inverted for) and the measured data states as the “complementary parameters”. We seek to quantify the relative importance of these complementary parameters to the solution of the inverse problem. To address this, we present a framework based on hyper-differential sensitivity analysis (HDSA), which computes the derivative of the solution of an inverse problem with respect to complementary parameters. We present a mathematical framework for HDSA in large-scale PDE-constrained inverse problems and show how HDSA can be interpreted to give insight about the inverse problem. We demonstrate the effectiveness of the method on a numerical example by evaluating uncertainty in boundary conditions, source terms, and model parameters.

ABSTRACT. Deformable image registration is essential to both diagnostics and therapeutics in precision and personalized medicine. Existing software packages for deformable registration either provide only a forward deformation vector field (DVF) or render both forward and backward DVFs with time-intensive computation. The latter has multiple advantages in medical image analysis; however, the transition of its benefits to clinical applications is hindered by its latency, which is substantially longer than clinical time windows. We introduce a novel DVF inversion method to facilitate the transition with algorithmic innovation. In the new approach, we utilize efficient existing software packages for forward DVF estimation. We introduce an efficient algorithm to invert the DVF to obtain the backward DVF at a high computation speed comparable to the forward DVF generation, and at high accuracy in inverse consistency. The key conceptual and algorithmic innovation is adaptive use of forward and backward inverse consistency (IC) residuals as feedbacks for inversion. Our previous work introduced the adaptive use of forward IC residuals to achieve global convergence. Building upon our previous work, we introduce a novel convergence acceleration stage in our algorithm employing the backward IC residual. The use of backward IC residuals is original. The iteration with backward IC residuals can be seen as an implicit Newton iteration, by convergence rate analysis. Experiment results with patient DVFs obtained by registering patient thoracic CT images show many order reductions in the magnitude of inverse-consistency
residuals in comparison to the state-of-the-art method. A fast and accurate DVF inversion algorithm has the potential to improve many other medical image analysis tasks such as 4D CT reconstruction, population atlas construction, and dose accumulation calculation, to name a few. The potential impact of DVF inversion in medical image analysis is yet to be fully explored.

Dynamic Non-Uniform Randomization in Asynchronous Linear Solvers

ABSTRACT. One approach to improving the performance of stationary linear solvers is to utilize asynchronous communication
among the processors. In order to establish bounds on their convergence rate, randomized variants of asynchronous
linear solvers have been studied. This approach has been examined previously for the case where the random selection
is done uniformly [5, 1]. A non-uniform random selection has been also used, albeit for the case of synchronous
algorithms [3, 2], such that the probability of selecting any given component remains fixed through the algorithm and
does not dynamically change. The idea behind the solvers considered here is for each processor to select the next
component to update randomly, using a distribution that more heavily weights selection of components that may be
more important to the solution in a certain sense. In particular, we evaluate the viability of changing the component
ranking dynamically and investigate the effect of various distributions in selecting from the component list.
This updating procedure is motivated by the Southwell iteration, which selects the component with the largest
contribution to the residual at each iteration and which can converge in fewer iterations compared with the traditional
relaxation schemes. Previously, it has been also shown that the Southwell-type iterations may converge faster than
than those using the uniform random selection [4]. There is a balance between the extra computational time required
to intelligently select components, and the savings in total iterations. Updating all the residuals every iteration likely
introduces too much computational overhead to be of practical use, and techniques are explored for lessening this
computational burden.

References
[1] Haim Avron, Alex Druinsky, and Anshul Gupta. Revisiting asynchronous linear solvers: Provable convergence
rate through randomization. Journal of the ACM (JACM), 62(6):51, 2015.
[2] Michael Griebel and Peter Oswald. Greedy and randomized versions of the multiplicative schwarz method. Linear
Algebra and its Applications, 437(7):1596–1610, 2012.
[3] Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and condi-
tioning. Mathematics of Operations Research, 35(3):641–654, 2010.
[4] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges
faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning,
pages 1632–1641, 2015.
[5] John C Strikwerda. A probabilistic analysis of asynchronous iteration. Linear algebra and its applications,
349(1-3):125–154, 2002.

Smoothing by asynchronous iterations and sparse approximate inverses for fluid dynamics on graphics processing units

ABSTRACT. This work considers the problem of fine-grain parallel smoothing in multigrid solvers for compressible turbulent flow problems. The methods considered include asynchronous point-block incomplete LU factorization (ABILU), asynchronous block triangular solvers and incomplete sparse approximate inverse (ISAI) triangular solvers. Previous work in this area includes chaotic relaxation [Chazan and Miranker, "Chaotic relaxation",Linear Algebra Appl. 2, 1969] and asynchronous ILU factorization [Chow and Patel, "Fine-grained parallel incomplete LU factorization", SIAM J. Sci. Comput. 37:2, 2015] for scalar elliptic problems and for segregated momentum and pressure equations of the incompressible Navier-Stokes equations [Hawkes et al., "Chaotic linear-system solvers for unsteady CFD", MARINE, 2015].

The authors have demonstrated experimentally that ABILU factorization combined with ISAI can be effective for the highly coupled compressible Reynolds-averaged Navier-Stokes equations on many-core central processing units [Kashi and Nadarajah, "Fine-grain parallel smoothing by asynchronous iterations and incomplete sparse approximate inverses for computational fluid dynamics", AIAA Scitech 2020 Forum]. The current work extends this to graphics processing units, presents some numerical results and analyzes the smoothing property of the iteration in this context. Numerical results on three-dimensional benchmark cases of external aerodynamics on multi-block structured grids will be presented.

Redundancy as a Remedy for Delayed Communication in Asynchronous Solvers

ABSTRACT. Recent years have seen the proliferation of smart network-connected devices that exist on the ”edge” of large control systems that are capable of distributed calculations, e.g., the power grid. Additionally, these devices communicate unreliably with the network, meaning that changes in communication should not halt the entire distributed calculation. In order to remove these kinds of vulnerabilities, we need resilient algorithms to implement on decentralized infrastructure networks. This motivates the study of decentralized, distributed algorithms that are robust to the underlying unreliable computing environment. In this talk, we present a parallel asynchronous Jacobi method for solving linear systems of equations where each process is responsible for updating and distributing several components of the solution vector. This introduces redundant computations to the asynchronous algorithm, adding increased resilience against communication delay, device failure, and possible changes in the communication network topology.

Block preconditioners for the the coupled Darcy-Stokes problem

ABSTRACT. The coupled Darcy-Stokes problem is an important problem in fluid flow
modeling. Upon finite element discretization, the resulting linear system
is of double saddle point type. We investigate different preconditioners
including block triangular and augmented Lagrangian-based ones. We
present a spectral analysis and field-of-value analysis of the exact versions
of these preconditioners, and present the results of numerical experiments
illustrating the performance of inexact variants of these methods on 3D
problems with large jumps in the permeability coefficients.

This is joint work with Fatemeh Beik (Vali-e-Asr University of Rafsanjan, Iran).

Block Preconditioning for Implicit Runge-Kutta Methods for Time-Dependent

ABSTRACT. Many important engineering and scientific systems require the solution of
time-dependent PDE systems. Many of these systems have specific stability
needs, such as needing A-stable or L-stable methods, to compute realistic solutions. For example, the Eddy Currents Equation is a stiff parabolic
PDE with a possible rapid transient, and benefits from an L-stable method.
Similarly, the incompressible Navier–Stokes equations for fluid flow are differen-
tial algebraic equations (DAE), and can be viewed as infinitely stiff.
L-stable time stepping methods can be beneficial here as well.

Certain classes of implicit Runge-Kutta (IRK) time-stepping methods, such
as the Radau I and Radau II methods provide L-stability, but one price of IRK methods is needing to solve large linear systems at each time step.
Suppose, for example, our PDE has been linearized and discretized with N
degrees of freedom. Using an s-stage IRK method leads to an sN by sN linear
system that must be solved at each time step. These systems are block-(s-by-s)
systems, where each block is N by N.

In this talk, we investigate preconditioners for such systems. In some cases,
we can take advantage of known structure in the subblocks. For example, in
time-dependent incompressible Navier–Stokes equations, each subblock is
related to a linear system from the steady-state fluid flow equations, for which
there are several effective known preconditioners.

Unconstrained Energy Functional Methods for Massively Parallel Iterative Eigensolvers

ABSTRACT. Many scientific applications require the solution of large scale eigenvalue problems.
For applications where a small percentage of the eigenpairs are required, rather
than the full spectrum, iterative rather than direct eigensolvers are typically used.
Often, the parallel scaling of these types of iterative eigensolvers on modern massively parallel
multi-core computers is limited by the reorthogonalization step, which typically uses
direct diagonalization of the subspace matrix, Cholesky or QR decomposition. The use of
unconstrained avoids the reorthogonalization step, so that the trial vectors are only
orthogonal at the minimum, leading to better parallel scaling.
The unconstrained formulation is implemented in the
first-principles materials and chemistry CP2K code, which performs electronic structure
calculations based on a density functional theory approximation to the solution of the
many-body Schrödinger equation. The systems we use in our studies have a number of atoms
ranging from 2,247 to 12,288. We study the convergence of the unconstrained formulation
with different preconditioners and its scaling on a Cray XC40 at NERSC (a partition with
9,688 Intel KNL nodes).

A System of Semilinear Singularly Perturbed Convection-Diffusion type

ABSTRACT. In case of solving system of semilinear singularly perturbed problems, the two significant steps have to be analyzed: i) constructing parameter-uniform difference schemes; ii) obtaining reliable and efficient numerical methods for solving nonlinear discrete problems. The monotone methods are useful tool for solving such nonlinear discrete problems. The nonlinear discrete problem is formulated as a nonlinear systems of algebraic equations and to solve these nonlinear systems of algebraic equations, we require some monotone method which leads to an iterative method for the computation of the numerical solutions.\vspace{.5cm}\\
We consider the following system of $M(\geq2)$ singularly perturbed semilinear convection-diffusion equations
\begin{subequations}\label{eq1}
\begin{align}
\textbf{\emph{Tu}}:=& -\textbf{\emph{E}}\textbf{\emph{u}}^{''}+\textbf{\emph{A}}\textbf{\emph{u}}^{'}+\textbf{\emph{f}}(x,\textbf{\emph{u}})=\textbf{\emph{0}}, ~~~~~ x\in\Om=(0,1)\nonumber\\
\textbf{\emph{u}}(0)=&\textbf{\emph{p}}_{0}, ~~~~~~ \textbf{\emph{u}}(1)=\textbf{\emph{p}}_{1},\nonumber
\end{align}
\end{subequations}
where $\textbf{\emph{E}}=\mbox{diag}(\e,\ldots,\e)$ with $0<\e\ll1$ (is a small positive perturbation parameter),
$\textbf{\emph{A}}=\mbox{diag}(a_{11},\ldots,a_{MM})$,
$\textbf{\emph{f}}(x,\textbf{\emph{u}})=(f_{1}(x,\textbf{\emph{u}}),\ldots,f_{M}(x,\textbf{\emph{u}}))^{T}$, and $\textbf{\emph{u}}=(u_{1},\ldots,u_{M})^{T}$.\vspace{.5cm}\\
\no We propose a high order numerical method for solving the system of $M(\geq2)$ singularly perturbed semilinear convection-diffusion equations. To solve this system numerically we design an overlapping domain decomposition algorithm. The uniform convergence of the algorithm is analyzed in two steps i.e. the discretization error and the iteration error. It is shown that the algorithm gives almost second order parameter-uniform
convergence. Numerical results are presented to demonstrate the efficiency of the method.

Generation of compressed stationary fields with circulant embedding

ABSTRACT. Random fields are important in many applications. They can represent varying properties of biological or physical systems and serve as inputs for uncertainty quantification (UQ) studies. Therefore, an approach for efficient generation of stochastic realisations is needed. Stationary Gaussian random fields can be obtained using the method of circulant embedding. The technique represents a covariance matrix by a block circulant with circulant blocks structure which enables eigendecomposition with Fourier transforms. However, due the resulting conjugate-even set of eigenvalues, finding a low-rank approximation of the matrix is not straightforward. We will present an optimal strategy allowing for rank control of random fields generated with circulant embedding. The proposed method is an extension of previously published algorithm for obtaining real-valued, circulant approximations. In addition, we will discuss the application of rank reduction in the context of UQ with multi-level Monte Carlo.

Condensed Nonconforming Reformulations for Preconditioning Finite Element Discretization Problems

ABSTRACT. We present an approach for building optimal preconditioners for algebraic linear systems coming from conforming finite element discretizations of partial differential equations. The preconditioners utilize the framework of auxiliary or fictitious space methods. Namely, the main idea is to reformulate the original problem in a nonconforming setting, using discontinuous elements across subdomains, obtaining the element-by-element assembly property. This is achieved by introducing additional unknowns associated with the interfaces between subdomains, which decouples the subdomains. The continuity is enforced weakly via interface penalty terms. The resulting nonconforming formulation (or a Schur complement (reduced) version of it, that "condenses" the formulation only to the interfaces) is used to precondition the original problem. The element-by-element assembly property can be useful in "matrix-free" computations for high order discretizations, since it minimizes the coupling across subdomains and interfaces. Also, this provides a natural setting to apply element-based algebraic multigrid (AMGe) techniques for solving the nonconforming formulation. The resulting decoupling and locality can be beneficial for parallel computing. The approach allows for directly obtaining coarse nonconforming formulations, or their reduced (Schur complement) versions. This largely reduces the cost of the "condensation" step.

New mixed methods for incompressible hyperelasticiy in terms of Kirchhoff stress

ABSTRACT. Using the three-field formulation for nearly incompressible hyperelasticity introduced in [Chavan, Lamichhane, Wohlmuth, Comput. Methods Appl. Mech. Engrg. (2007), 196:4075-4086] we define a similar form valid for the fully incompressible case. We define a mixed finite element scheme and verify theoretical rates of convergence through computational tests. We also propose a new augmented Lagrangian preconditioner that improves convergence properties of iterative solvers. A few benchmark solutions are computed, and we test the formulation in models of cardiac biomechanics.

An Adaptive Newton Div-Curl Least-Squares Finite Element Method for the Cahn-Hilliard Equation

ABSTRACT. In this talk we present recent results on a new least-squares finite element approach for the Cahn-Hilliard equation for binary phase separation. As a fourth-order evolution equation derived from the gradient flow of an energy functional, the Cahn-Hilliard equation remains a computationally challenging nonlinear problem which is fundamental in a wide range of applications. We discretize in time with both implicit backward Euler and Crank-Nicolson, and in space we use a least-squares reformulation and an outer Newton-like linearization at the continuous level to define a well-posed sequence of div-curl systems. Numerical results show promising results: quadratic convergence of the nonlinear iteration in a small number of steps, and optimal discretization convergence rates with respect to both time and space. Adaptivity in time and space, mass conservation, and energy dissipation are also detailed for a range of test problems.

Adaptive Space-Time CFOSLS for Evolution Equations

ABSTRACT. We present adaptive space-time finite element schemes based on constrained first order system least squares (CFOSLS) formulations using the least squares functional restricted to individual elements as indicators for adaptive refinements. The marked elements are then refined using bisection. We consider both parabolic and hyperbolic evolution problems, focusing on the latter, in particular on the transport equation. We performed an extensive set of numerical experiments in parallel, and in both 3D and 4D space-time, to illustrate the performance of the method.

This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.