Program for Tuesday, October 19th

PROGRAM FOR TUESDAY, OCTOBER 19TH

Days:

10:00-12:15 Session 1: Welcome to IXPUG 2021

Chair:

10:00	Amit Ruhela Opening Remarks
10:05	Dan Stanzione Welcome to TACC
10:20	Lela Vuković Computational Modeling of Nanomaterials for Biomedical Applications
11:20	Sergei Vinogradov Unified Memory API for heterogeneous memory ABSTRACT. The amount of data that needs to be processed by modern server workloads is approximately doubling every 24 months. To address the growing demand, the memory subsystem of modern server platforms is becoming heterogeneous. Intel Optane Persistent memory closes the capacity gap. High Bandwidth Memory addresses the throughput needs. Possible memory expansion via CXL interconnects increases heterogeneity. These emerging memory technologies require SW adoption to fully leverage new HW capability. Today application developers need to deal with all possible memory configurations (like DRAM+PMEM, HBM+DRAM, etc). This talk presents a high-level Memory API for heterogeneous memory that provides a unified application view to memory configurations. As a case study, we prototyped heterogeneous memory support in a real-world application - OmniSciDB (database for data analytics). With a DRAM+PMEM configuration, we demonstrate the performance advantages of explicit data placement via the proposed Memory API. We also demonstrate that no additional changes on the application side are required to support HBM+DRAM or even HBM+DRAM+PMEM configurations.
11:50	Hongbo Rong and Xiaochen Hao Tensor Computing with High Productivity, Performance, and Portability across Intel Spatial and Vector Architectures ABSTRACT. The death of Moore’s Law and the rapid development of AI have led to the booming of accelerators for tensor computations. Spatial architectures like FPGAs and vector architectures like GPUs are commonly used for creating such accelerators. FPGAs are well known to be difficult and very time-consuming to program for performance. And it takes time to engineer performance for each important compute on each architecture. We propose a new programming methodology, T2S (Temporal to Spatial), to make it easier to program FPGAs. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the spatial mapping, including many generic, strategic loop and data optimizations. Consequently, high performance is expected with substantially higher productivity. We further extend the methodology to heterogeneous systems including both FPGAs and GPUs. This brings two big advances. First, building high-performance user-managed caches for tensors is usually a big hurdle for average software programmers. We make this easy by allowing programmers to specify a data flow across an abstract memory system. Second, unlike CUDA and OpenCL, our programming model explicitly combines SIMT and SIMD, and realizes SIMD (multi-dimensional vectorization) by specifying architecture-agnostic multi-dimensional systolic arrays. The compiler generates optimized hardware/software systolic arrays on FPGAs/GPUs. The explicit SIMD gives programmers convenient control in utilizing critical hardware resources (e.g. DSPs on FPGAs and SIMD units on GPUs). We have prototyped our methodology for Intel FPGAs and GPUs. We implemented several high-profile kernels that are expressed in tensors and with different compute patterns (SGEMM, 2-D convolution, Capsule convolution, PairHMM, and LU matrix decomposition) across 2 generations of FPGAs (Arria 10 and Stratix 10) and 3 generations of GPUs (GEN 9.5, GEN 12, and a latest research GPU), and compared them with state-of-the-art ninja implementations, and machine peaks, when available. Our designs provide throughput on par with or better than state-of-the-art comparison points, achieve close to 100% DSP and PE efficiencies on both FPGAs, and 75%-91% theoretical peak throughput on all GPUs, for all kernels where such measurements are applicable. A nearly full DSP/PE efficiency on an FPGA indicates that the compute engine we generate is kept busy with useful operations without pipeline stalls (due to, e.g. slow memory operations). This performance is achieved with a very compact, high-level code specification – all kernels have only several tens of lines of code, and can be written in under half an hour. These results confirm that our methodology can achieve high performance and programmer productivity on a range of tensor operations, and performance can be made largely portable across spatial and vector architectures.

12:45-14:15 Session 2: Perormance Portability

Chair:

Thomas Steinke

12:45

Jason Sewall and John Pennycook

Balancing Performance, Portability, and Productivity with oneAPI

PRESENTER: Jason Sewall

13:45

Tsung-Wei Huang

Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System Using Modern C++

ABSTRACT. The Taskflow project addresses the long-standing question: "How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity?" Modern scientific computing relies on a heterogeneous mix of computational patterns, domain algorithms, and specialized hardware to achieve key scientific milestones that go beyond traditional capabilities. However, programming these applications often requires complex expert-level tools and a deep understanding of software methodologies. Specifically, the lack of a suitable software environment that can overcome the complexity of programming large parallel and heterogeneous systems has posed a significant barrier for many organizations to facilitate transformational discoveries.

Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic task graph constructions to incorporate a broad range of computational patterns including hybrid CPU-GPU computing, dynamic control flow, and irregularity. We develop an efficient heterogeneous work-stealing strategy that adapts worker threads to available task parallelism at any time during the graph execution. We have demonstrated promising performance of Taskflow on both micro-benchmark and real-world applications. As an example, we solved a large machine learning workload by up to 1.5x faster, 1.6x less memory, and 1.7x fewer lines of code than two industrial-strength systems, oneTBB and StarPU, on a machine of 40 CPUs and 4 GPUs.

This talk will cover three aspects: (1) heterogeneous task programming model using modern C++, (2) an efficient work-stealing strategy generalizable to arbitrary heterogeneous domains, and (3) user experience we have obtained and suggested roadmap for the HPC community in face of future heterogeneity.

The Taskflow project is available at https://taskflow.github.io/

14:30-16:00 Session 3: Tutorial

Chair:

Amit Ruhela

14:30

Zakhar Matveev, Dmitry Petrov, Mariya Petrova and Vladimir Tsymbal

Performance modelling of future GPUs with Offload Advisor

ABSTRACT. Offload Advisor is a GPU performance modelling tool with extra focus on HPC domain. The tool analyses performance of a given workload on baseline device, CPU or GPU, and estimates workload performance on target high end GPUs. Tutorial would include: - Demonstration of modeling for the application run on integrated GPU, with estimation provided for future GPUs - Bottlenecks for current measured and future modeled GPUs, along with automatic Performance Improvement Recommendations are shown - GPU Roofline is demonstrated Tutorial submitters have a solid experience presenting in HPC events (e.g. SC tutorials), high-end customer multi-day virtual or on-site trainings as well as various Intel-organized events, including multiple IXPUG conferences..