Program for Thursday, May 25th

PROGRAM FOR THURSDAY, MAY 25TH

Chair:

09:00

Keynote: Performance Portability for Next-Generation Heterogeneous Systems

ABSTRACT. There is a huge diversity in the processors used to power the leading supercomputers. Despite their differences in how they need to be programmed, these processors lie on a spectrum of design. GPU-accelerated systems are optimised for throughput calculations providing high memory bandwidth; CPUs provide deep and complex cache hierarchies to improve memory latency; and both use vector units to bolster compute performance. Competitive processors are available from a multitude of vendors, with each becoming more heterogeneous with every generation. This gives us as a HPC community a choice, but how do we write our applications to make the most of this opportunity?

Our high-performance applications must be written to embrace the full ecosystem of supercomputer design. They need to take advantage of the hierarchy of concurrency on offer, and utilise the whole processor. And writing these applications must be productive because HPC software outlives any one system. Our applications need to address the “Three Ps” and be Performance Portable and Productive.

This talk will highlight the opportunities this variety of heterogeneous architectures brings to applications, and how application performance and portability can be rigorously measured and compared across diverse architectures. It will share a strategy for writing performance portable applications and present the roles that ISO languages C++ and Fortran, as well as parallel programming models and abstractions such as OpenMP, SYCL and Kokkos play in the ever changing heterogeneous landscape.

10:00-11:00 Session 2: Technical Presentations

Chair:

Glenn Brook

10:00

John D. McCalpin

Bandwidth Limits in the Intel Xeon Max (Sapphire Rapids with HBM) Processors

ABSTRACT. The HBM memory of Intel Xeon Max processors provides significantly higher sustained memory bandwidth than their DDR5 memory, with corre-sponding increases in the performance of bandwidth-sensitive applications. However, the increase in sustained memory bandwidth is much smaller than the increase in peak memory bandwidth. Using custom microbench-marks (instrumented with hardware performance counters) and analytical modeling, the primary bandwidth limiter is shown to be insufficient memory concurrency. Secondary bandwidth limitations due to non-uniform loading of the two-dimensional on-chip mesh interconnect are shown to arise not far behind the primary limiters.

10:30

Dean Chester, Aruna Ramanan and Mark Atkins

Modelling Next Generation High Performance Computing Fabrics

ABSTRACT. Simulation provides insight in to physical phenomenon that could otherwise not be understood. It allows detailed analysis to take place that may not be able to captured in a significant level of detail from the physical world. This insight can be used to build better products and design better features used inside of next generation low latency fabrics. This extended abstract looks at validation and development of a simulator to support the development of next generation fabrics.

11:00-11:30Coffee Break

11:30-13:00 Session 3: Technical Presentations

Chair:

Thomas Steinke

11:30	Michael Hennecke, Johann Lombardi, Jeff Olivier, Tom Nabarro, Liang Zhen, Yawei Niu, Shilong Wang and Xuezhao Liu DAOS beyond PMem: Architecture and Initial Performance Results ABSTRACT. The Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. Until now, the DAOS storage stack has been based on Intel Optane Persistent Memory (PMem) and the Persistent Memory Development Kit (PMDK). With the discontinuation of Optane PMem, and no persistent CXL.mem devices in the market yet, DAOS continues to support PMem-based servers but now also supports server configurations where its Versioning Object Store (VOS) is held in DRAM. In this case, the VOS data structures are persisted through a synchronous Write-Ahead-Log (WAL) combined with asynchronous checkpointing to NVMe SSDs. This paper describes the new non-PMem DAOS architecture, and reports first performance results based on a DAOS 2.4 technology preview.
12:00	Yehonatan Fridman, Guy Tamir and Gal Oren Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators ABSTRACT. Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size 400^3 scalably due to the larger memory size.
12:30	Sai Prabhakar Rao Chenna, Nalini Kumar, Leonardo Borges, Michael Steyer, Philippe Thierry and Maria Garzaran Enabling Multi-level Network Modeling in Structural Simulation Toolkit for Next-Generation HPC Network Design Space Exploration ABSTRACT. The last decade has seen high-performance computing (HPC) systems become denser and denser. Higher node and rack density has led to development of multi-level networks - at socket, node, 'pod', rack, and between nodes. As sockets become more complex with integrated or co-packaged heterogeneous architectures, this network complexity is going to increase. In this paper, we extend Structural Simulation Toolkit (SST) to model these multi-level networks designs. We demonstrate this newly introduced capability by modeling a combination of a few different network topologies at different levels of the system and simulating the performance of collectives and some popular HPC communication patterns.

13:00-14:00Lunch Break

14:00-15:00 Session 4: Keynote 2

Chair:

David Martin

14:00

Joseph Curley

Keynote: Building a Productive, Open, Accelerated Programming Model for Now and the Future

ABSTRACT. The needs for energy-efficient computing to solve problems of exascale class computing, emerging distributed artificial intelligence, and intelligent devices at the edge among others have driven what was referred to in 2019 as a “A New Golden Age for Computer Architecture.”¹ The growth of diverse CPU, GPU, and other accelerator architectures comes with complexity in programming from predicted domain-specific languages. In practice, the per device programming models of novel architectures present challenges developing, deploying, and maintaining software. This talk will cover Intel’s efforts to date to enable multi-architecture, multi-vendor accelerated programming and progress to date. Additionally, this session will discuss research directions to make accelerated computing increasingly productive.

¹ J.L. Hennessy and D.A. Patterson: “A New Golden Age for Computer Architecture” in Communications of the ACM 62.2 (2019), pp. 48-60)

15:00-16:00 Session 5: Technical Presentations

Chair:

David Martin

15:00

Yinzhi Wang, John D. McCalpin, Junjie Li, Matthew Cawood, John Cazes, Hanning Chen, Lars Koesterke, Hang Liu, Chun-Yaung Lu, Robert McLay, Kent Milfield, Amit Ruhela, Dave Semeraro and Wenyang Zhang

Application Performance Analysis: a Report on the Impact of Memory Bandwidth

PRESENTER: John D. McCalpin

ABSTRACT. As High-Performance Computing (HPC) applications involving massive data sets, including large-scale simulations, data analytics, and machine learning, continue to grow in importance, memory bandwidth has emerged as a critical performance factor in contemporary HPC systems. The rapidly escalating memory performance requirements, which traditional DRAM memories often fail to satisfy, necessitate the use of High-Bandwidth Memory (HBM), which offers high bandwidth, low power consumption, and high integration capacity, making it a promising solution for next-generation platforms. However, despite the notable increase in memory bandwidth on modern systems, no prior work has comprehensively assessed the memory bandwidth requirements of a diverse set of HPC applications and provided sufficient justification for the cost of HBM with potential performance gain. This work presents a performance analysis of a diverse range of scientific applications as well as standard benchmarks on platforms with varying memory bandwidth. The study shows that while the performance improvement of scientific applications varies quite a bit, some applications in CFD, Earth Science, and Physics show significant performance gains with HBM. Furthermore, a cost-effectiveness analysis suggests that the applications exhibiting at least a 30% speedup on the HBM platform would justify the additional cost of the HBM.

15:30

John Swinburne, Robert Bollig and James Erwin

Omni-Path Express (OPX) Libfabric Provider Performance Evaluation

ABSTRACT. The introduction of the Omni-Path Express (OPX) Libfabrics provider by Cornelis Networks delivers improved performance and capabilities for the current generation of Omni-Path 100 fabric as well as setting the software foundation for future generations of Omni-Path fabric. This session will study the performance characteristics of the OPX provider using MPI microbenchmarks and make competitive comparisons using application workloads.

16:00-16:30Coffee Break

16:30-17:00 Session 6: Technical Presentations

Chair:

Amit Ruhela

16:30

Samuel Bernardo, Amit Ruhela, John Cazes and Stephen Harrell

Apache Spark performance optimization over Frontera HPC cluster

ABSTRACT. Apache Spark is a very popular computing engine that allows you to distribute the computing task among a computing cluster. This paper briefly describes the Frontera supercomputer, summarizes the Apache Spark components and deployment methods and provides an optimized proposal to run Apache Spark jobs. To complete this experiment, the simulation results using different sizes of samples allows us to understand the scalability and performance of the proposed implementation.

17:00-18:00 Session 7: Keynote 3

Chair:

Amit Ruhela

17:00

Taisuke Boku

Keynote: Next-Gen Acceleration with Multi-Hybrid Devices – Is GPU Enough?

ABSTRACT. Accelerated computing with GPU is now the main player on HPC and AI with its absolute performance in FLOPS with multiple precision and high bandwidth of memory. It covers major application fields to fit with its performance characteristics, however, it is not perfect for some class of computation with complicated construction.

We think one of the next solutions is multi-hybrid acceleration where different kind of accelerators such as GPU and FPGA by mutual compensation with each other. We are running a conceptual PC cluster named Cygnus under the concept of CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices). In this talk, I will present from the concept to the real application showing the next generation of ideal accelerated computing.