Program for Monday, August 30th

PROGRAM FOR MONDAY, AUGUST 30TH

Days:

next day

all days

View: session overview talk overview

09:00-10:30 Session 1A: PhD Symposium 1

Chairs:

Aleksandar Ilic and Didem Unat

Location: PhD Symposium Room

09:00	Alena Shilova Memory Efficient Deep Neural Network Training ABSTRACT. Recently Artificial Intelligence(AI) has demonstrated a huge progress in solving complex problems such as image classification, text generation, translation... Its success is due to a development of hardware and algorithms making possible the emergence of Deep Neural Networks (DNNs). Such DNNs are composed of a number of operations from numerical linear algebra, whose order is defined with a Directed Acyclic Graph (DAG). These DNNs operate at the limit of the computational resources, and to go to deeper and more complex neural networks it is necessary either to design more powerful computational resources or optimize their usage. In our work, we address in particular the problem of lowering memory usage during the training of DNNs.
09:18	Pawel Bratek, Lukasz Szustak and Jaroslaw Zola Parallelization and auto-scheduling of data access queries in ML workloads PRESENTER: Pawel Bratek ABSTRACT. We propose an auto-scheduling mechanism to execute counting queries in machine learning applications. Our approach improves the runtime efficiency of query streams by selecting, in the on-line manner, the optimal execution strategy for each query. We also discuss how to scale up counting queries in multi-threaded applications.
09:36	Manasi Tiwari and Sathish Vadhiyar Communication overlapping Pipelined Conjugate Gradients for Distributed Memory Systems and Heterogeneous Architectures PRESENTER: Manasi Tiwari ABSTRACT. Preconditioned Conjugate Gradient (PCG) method has been one of the widely used methods for solving linear systems of equations for sparse problems. Pipelined PCG (PIPECG) attempts to eliminate the dependencies in the computations in the PCG algorithm and overlap non-dependent computations by reorganizing the traditional PCG code and using non-blocking allreduces. We have developed a novel pipelined PCG algorithm called PIPECG-OATI (One Allreduce per Two Iterations) which reduces the number of non-blocking allreduces to one per two iterations and provides large overlap of global communication and computations at higher number of cores in distributed memory CPU systems. PIPECG-OATI gives up to 3x speedup over PCG and 1.73x speedup over PIPECG at large number of cores. For GPU accelerated heterogeneous architectures, we have developed three methods for efficient execution of the PIPECG algorithm. These methods achieve task and data parallelism. Our methods give considerable performance improvements over PCG CPU and GPU implementations of Paralution and PETSc libraries.
09:54	Raju Ram, Daniel Grünewald and Nicolas R Gauger Scalable hybrid parallel ILU preconditioner to solve sparse linear systems PRESENTER: Raju Ram ABSTRACT. Incomplete LU(ILU) preconditioners are widely used to improve the convergence of general-purpose large sparse linear systems in computational simulations because of their robustness, accuracy, and usability as a black-box preconditioner. However, the ILU factorization and the subsequent triangular solve are sequential for sparse matrices in their original form. Multilevel nested dissection (MLND) ordering can resolve that issue and expose some parallelism. This work investigates the parallel efficiency of a hybrid parallel ILU preconditioner that combines a restricted additive Schwarz (RAS) method on the process level with a shared memory parallel MLND Crout ILU method on the core level. We employ the GASPI programming model to efficiently implement the data exchange on the process level. We show the scalability results of our approach for the convection-diffusion problem.
10:12	Muhammad Aditya Sasongko, Didem Unat and Milind Chabbi Low-Overhead Reuse Distance Profiling Tool for Multicore PRESENTER: Muhammad Aditya Sasongko ABSTRACT. With the increase in core count in multicore systems, data movement is one of the main sources of performance slowdown in parallel applications and data locality has become a critical factor in application optimization. One of the important locality metrics is reuse distance,which shows the likelihood of a memory access to be a cache hit. In this work, we propose a low-overhead reuse distance profiling tool for multi-threaded applications. Our method relies on available hardware features in commodity CPUs, namely, Performance Monitoring Units (PMUs) and debug registers, to detect data reuse in private and shared caches by considering inter-thread cache line invalidations. Unlike prior approaches,our tool is fast, accurate, does not change the program behavior and can also handle shared cache accesses. Though it has low runtime (2.9×) and memory overheads (2.8×), our tool achieves 92% accuracy.

09:00-10:30 Session 1B: ParaMo 2021 - 3rd International Workshop on Parallel Programming Models in High-Performance Cloud

Chairs:

Hyun-Wook Jin and Sangyoon Oh

Location: ParaMo Workshop Room

09:00

Chuck Yoo

Network SLO for High Performance Clouds

ABSTRACT. Network is a critical resource in high performance clouds. Yet, most resource provisioning algorithms deal with CPU and memory but not network. So network performance is managed by a separate mechanism such as tc. Also, it is known that network SLO (service level objective) is hard to achieve. As more distributed and parallel applications run on clouds, the importance of satisfying the network SLO cannot be over-emphasized. This talk presents why it is hard to achieve the network SLO and how to minimize the variance in delivering the network SLO, which demonstrates the need for the integrated scheduling of CPU and network.

09:50

Benjamin J. J. Pfister, Wolf S. Lickefett, Jan Nitschke, Sumit Paul, Morgan K. Geldenhuys, Dominik Scheinert, Kordian Gontarska and Lauritz Thamsen

Rafiki: Task-level Capacity Planning in Distributed Stream Processing Systems

PRESENTER: Benjamin J. J. Pfister

ABSTRACT. Distributed Stream Processing is a valuable paradigm for reliably processing vast amounts of data at high throughput rates with low end-to-end latencies. Most systems of this type offer a fine-grained level of control to parallelize the computation of individual tasks within a streaming job. Adjusting the parallelism of tasks has a direct impact on the overall level of throughput a job can provide as well as the amount of resources required to provide an adequate level of service. However, finding optimal parallelism configurations that fall within the expected Quality of Service requirements is no small feat to accomplish.

In this paper we present Rafiki, an approach to automatically determine optimal parallelism configurations for Distributed Stream Processing jobs. Here we conduct a number of proactive profiling runs to gather information about the processing capacities of individual tasks, thereby making the selection of specific utilization targets possible. Understanding the capacity information enables users to adequately provision resources so that streaming jobs can deliver the desired level of service at a reduced operational cost with predictable recovery times. We implemented Rafiki prototypically together with Apache Flink where we demonstrate its usefulness experimentally.

09:00-10:30 Session 1C: AMTE 2021 - Asynchronous Many-Task systems for Exascale 2021

Chairs:

Irina Demeshko and Patrick Diehl

Location: AMTE Workshop Room

09:00

Thomas Fahringer

On using modern C++ and nested recursive task parallelism for HPC applications with AllScale

ABSTRACT. Contemporary parallel programming approaches often rely on well-established parallel libraries and language extensions to address specific HW resources that can lead to mixed parallel programming paradigms. In contrast to these approaches, AllScale proposes a C++ template-based approach to ease the development of scalable and efficient general-purpose parallel applications. Applications utilize a pool of parallel primitives and data structures for building solutions to their domain-specific problems. HPC experts who provision high-level, generic operators and data structures for common use cases design these parallel primitives. The supported set of constructs may range from ordinary parallel loops, over stencil and distributed graph operations, and frequently utilized data structures including (adaptive) multidimensional grids, trees, and irregular meshes, to combinations of data structures and operations like entire linear algebra libraries. This set of parallel primitives is implemented using pure C++ and may be freely extended by third-party developers, similar to conventional libraries in C++ development projects. One of the peculiarities of AllScale is its main source of parallelism based on nested recursive task parallelism. Sophisticated compiler analysis determines the data needed for every task, which is of paramount importance to achieve performance across various parallel architectures. Experimental results for several applications implemented with AllScale will be shown.

09:50

Nitin Bhat, Sam White and Laxmikant V Kale

Enabling support for zero copy semantics in an Asynchronous Task-based Programming Model

PRESENTER: Sam White

ABSTRACT. Communication is critical to the scalable and efficient performance of scientific simulations on extreme scale computing systems. Part of the promise of task-based programming models is that they can naturally overlap communication with computation and exploit locality between tasks. Copy-based semantics using eager communication protocols easily enable such asynchrony by alleviating the responsibility of buffer management from the user, both on the sender and the receiver. However, these semantics increase memory allocations and copies and in turn affect application memory footprint and performance, especially with large message buffers. In this work we describe how the so-called "zero copy'' messaging semantics can be supported in Converse, the message-driven parallel programming framework that is used by Charm++, by implementing support for user-owned buffer transfers in its lower level runtime system, LRTS. These semantics work on user-provided buffers and do not semantically require copies by either the user or the runtime system. We motivate our work by reviewing the existing messaging model in Converse/Charm++, identify its semantic shortcomings, and define new LRTS and Converse APIs to support zero copy communication based on RDMA capabilities. We demonstrate the utility of our new communication interfaces with benchmarks written in Converse. The result is up to 91% of message latency improvement and improved memory usage. These advances will enable future work on user-facing APIs in Charm++.

10:30-10:50Coffee Break

10:50-12:30 Session 2A: PhD Symposium 2

Chairs:

Aleksandar Ilic and Didem Unat

Location: PhD Symposium Room

10:50	Philippe Swartvagher Interferences between Communications and Computations in Distributed HPC Systems ABSTRACT. Overlapping communications with computations in distributed applications should increase their performances and allow to reach better scalability. This implies, by construction, communications are executed in parallel of computations. In this work, we explore the impact of computations on communication performances and vice-versa, with a focus on the role of memory contention. One main observation is that highly memory-bound computations can have a severe impact on network bandwidth.
11:08	Chenle Yu, Sara Royuela and Eduardo Quiñones A Low Overhead Tasking Model for OpenMP PRESENTER: Chenle Yu ABSTRACT. OpenMP is a parallel programming model widely used on shared-memory systems. Over the years, the OpenMP community tries to extend the OpenMP Specification to adapt it to modern architectures and expand its usage to other domains such as Embedded Systems. Our work focuses on improving the OpenMP tasking model by reducing the task runtime overhead. To do so, we propose a new OpenMP framework, namely, taskgraph, based on the concept of Task Dependency Graph, where nodes are OpenMP tasks and edges describe the dependencies among them. The new framework is shown to be particularly suitable for fine-grain parallelism. It can be extended to other programming models with ease, improving the interoperability of OpenMP with different programming models, such as CUDA.
11:26	Daniel Alberto Torres Gonzalez, Camille Coti and Laure Petrucci Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale PRESENTER: Daniel Alberto Torres Gonzalez ABSTRACT. Large scale architectures provide us with high computing power, but as the size of the systems grows, computation units are more likely to fail. Fault-tolerant mechanisms have arisen in parallel computing to face the challenge of dealing with all possible errors that may occur at any moment during the execution of parallel programs. Algorithms used by fault-tolerant programs must scale and be resilient to software/hardware failures. Recent parallel algorithms have demonstrated properties that can be exploited to make them fault-tolerant. In my thesis, I design, implement and evaluate parallel and distributed fault-tolerant numerical computation kernels for dense linear algebra. I take advantage of intrinsic algebraic and algorithmic properties of communication-avoiding algorithms in order to make them fault-tolerant. I am focusing on dense matrix factorization kernels: I have results on LU and preliminary results on QR. Using performance evaluation and formal methods, I am showing that they can tolerate crash-type failures, either re-spawning new processes on-the-fly or ignoring the error.
11:44	David Marcelo Petrocelli, Armando De Giusti and Marcelo Naiouf Collaborative, distributed, scalable and low-cost plat-form based on microservices, containers, mobile devices and Cloud services to solve compute-intensive tasks PRESENTER: David Marcelo Petrocelli ABSTRACT. When solving compute-intensive tasks, CPU/GPU hardware resources and specialized Grid, Custer, Cloud infrastructure are commonly used to achieve high performance. However, this requires a high initial capital expense and ongoing maintenance costs. In contrast, ARM-based mobile devices regularly see improvement in their capacity, stability, and processing power daily while becoming ever more ubiquitous and requiring no massive capital or operating expenditures thanks to their reduced size and energy efficiency. Given this shifting computer paradigm, it is conceivable that a cost- and power-efficient solution for our world’s HPC processing tasks would include ARM-based mobile devices, while they are idle during recharging periods. We proposed, developed, deployed and evaluated a distributed, collaborative, elastic and low-cost platform to solve HPC tasks recycling ARM mobile resources based on Cloud, microservices and containers, efficiently orchestrated via Kubernetes. To validate the system scalability, flexibility, and performance a lot of concurrent video transcoding scenarios were run. The results showed the system allows for improvements in terms of scalability, flexibility, stability, efficiency, and cost for HPC workloads
12:02	Daniel Maier Model-based Loop Perforation ABSTRACT. Many applications require a lower level of accuracy to compute good-enough results than the level of accuracy provided by the platform. Exploiting this gap specifically is the concept of Approximate Computing, where a small reduction in accuracy is traded for better performance or a reduction in energy consumption. We propose a novel approach for memory-aware perforation of GPU threads. The technique is further optimized and we show its applicability on embedded GPUs. In order to fully utilize the opportunities of our approach, we prosose a novel framework for automatic loop nest approximation based on polyhedral compilation. This approach generalizes state-of-the-art perforation techniques and introduces new multi-dimensional perforation schemes. Moreover, the approach is augmented with a reconstruction technique that significantly improves the accuracy of the results. As the transformation space is potentially large, we propose a pruning method to remove low-quality transformations.

10:50-12:30 Session 2B: ParaMo 2021 - 3rd International Workshop on Parallel Programming Models in High-Performance Cloud

Chairs:

Hyun-Wook Jin and Sangyoon Oh

Location: ParaMo Workshop Room

10:50

Ramazan Faruk Oguz, Mert Oz, Mehmet Sıddık Aktas and Erdi Olmezogullari

Extracting Information from Large Scale Graph Data: Case Study on Automated UI Testing

PRESENTER: Ramazan Faruk Oguz

ABSTRACT. Even though a large-scale graph structure is a powerful model to solve several challenging problems in various applications' domains today, it can also preserve various raw essences regarding user behavior, especially in the e-commerce domain. Information extraction is a promising research area in deep learning algorithms using large-scale graph data. This study focuses on understanding users' implicit navigational behavior on an e-commerce site that we can represent with the large-scale graph data. We propose a GAN-based e-business workflow by leveraging the large-scale browsing graph data and the footprints of navigational users' behavior on the e-commerce site. With this method, we have discovered various frequently repeated clickstream data sequences, which do not appear in training data at all. Therefore, We developed a prototype application to demonstrate performance tests on the proposed business e-workflow. The experimental studies we conducted show that the proposed methodology produces noticeable and reasonable outcomes for our prototype application.

11:15

Ruibo Chen and Wenjun Wu

Parallelizing Automatic Model Management System for AIOps on Microservice Platforms

PRESENTER: Ruibo Chen

ABSTRACT. With the gradual increase in the scale of applications based on microservice architecture, the complexity of system operation and maintenance is also significantly increasing. The emergence of AIOps makes it possible to automatically detect the state, allocate the resources, warn and detect the anomaly of the system through some machine learning models. Given dynamic online workloads, the running state of a production microservice system is constantly in flux . Therefore, it is necessary to continuously train, encapsulate and deploy models based on the current system status, so that the AIOps model can dynamically adapt to the system environment. To address this problem, this paper proposes a model management pipeline framework for AIOps on microservice platforms, and implements a prototype system based on Kubernetes to verify the framework. The system consists of three components: model training, model packaging and model deploying. Parallelization and parameter search are introduced in the model training process to support rapid training of multiple models and automated model hyperparameter tuning. Rapid deployment of models is supported by the model packaging and deploying components. Experiments were performed to verify the prototype system, and the experimental results illustrate the feasibility of the proposed framework. This work provides a valuable reference for the construction of an integrated and streamlined AIOps model management system.

10:50-12:30 Session 2C: AMTE 2021 - Asynchronous Many-Task systems for Exascale 2021

Chairs:

Irina Demeshko and Patrick Diehl

Location: AMTE Workshop Room

10:50	Cheng-Hsiang Chiu, Dian-Lun Lin and Tsung-Wei Huang An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads PRESENTER: Cheng-Hsiang Chiu ABSTRACT. Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model has demonstrated significant success on this front, the counterpart for SYCL, a general-purpose heterogeneous programming model using standard C++, remains nascent. Unlike CUDA graph, SYCL runtime leverages out-of-order queues to implicitly create a task execution graph induced by data dependencies. For explicit task dependencies, users are responsible for creating SYCL events and synchronizing them at a non-negligible cost. Furthermore, there is no specialized graph execution model that allows users to offload a task graph directly onto a SYCL device in a similar way to CUDA graph. This paper conducts an experimental study of SYCL’s default task graph parallelism by comparing it with CUDA graph on large-scale machine learning workloads in the recent HPEC Graph Challenge. Our result highlights the need for a new SYCL graph execution model in the standard.
11:15	Pedro Valero-Lara, Jungwon Kim, Oscar Hernandez and Jeffrey Vetter OpenMP target task: tasking and target offloading on heterogeneous systems PRESENTER: Pedro Valero-Lara ABSTRACT. This work evaluated the use of OpenMP tasking with tar-get GPU offloading as a potential solution for programming productivity and performance on heterogeneous systems. Also, it is proposed a new OpenMP specification to make the implementation of heterogeneous codes simpler by using OpenMP target task, which integrates both OpenMP tasking and target GPU offloading in a single OpenMP pragma. As a test case, the authors used one of the most popular and widely used Basic Linear Algebra Subprogram Level-3 routines: triangular solver (TRSM). To benefit from the heterogeneity of the current high-performance computing systems, the authors propose a different parallelization of the algorithm by using a nonuniform decomposition of the problem. This work used target GPU offloading inside OpenMP tasks to address the heterogeneity found in the hardware. This new approach can outperform the state-of-the-art algorithms, which use a uniform decomposition of the data, on both the CPU-only and hybrid CPU-GPU systems, reaching speedups of up to one order of magnitude. The performance that this approach achieves is faster than the IBM ESSL math library on CPU and competitive relative to a highly optimized heterogeneous CUDA version. One node of Oak Ridge National Laboratory’s supercomputer, Summit, was used for performance analysis.
11:40	Shahrzad Shirzad, Rod Tohid, Alireza Kheirkhahan, Bibek Wagle and Hartmut Kaiser Understanding the Effect of Task Granularity on Execution Time in Asynchronous Many-Task Runtime Systems PRESENTER: Shahrzad Shirzad ABSTRACT. Task granularity is a key factor in determining the performance of asynchronous many-task (AMT) runtime systems. The over-head of scheduling an excessive number of tasks with smaller granularities causes performance degradation while creating a few larger tasks leads to starvation and therefore under-utilization of resources. In this paper, we developed an analytical model of the execution time of an application with balanced parallel for-loops in terms of grain size, and number of cores. The parameters of this model mostly depend on the runtime and the architecture. We introduce an approach to suggest a range of possible grain sizes to achieve the best performance based on the proposed model. To the best of our knowledge, our analytical model is the first to explain the relationship between the execution time in terms of grain size, runtime, and physical characteristics of the machine in an asynchronous runtime system.

12:30-13:30Lunch Break

13:30-15:00 Session 3A: COLOC - 5th Workshop on Data Locality

Chairs:

Anshu Dubey and Emmanuel Jeannot

Location: COLOC Workshop Room

13:30	Torsten Hoefler Data-Centric Python - Productivity, portability and all with high performance!
14:05	Maxime Gonthier, Loris Marchal and Samuel Thibault Locality-Aware Scheduling of Independent Tasks for Runtime Systems PRESENTER: Maxime Gonthier ABSTRACT. A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, a particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to order tasks and prefetch their input data in the GPU memory (after possibly evicting some previously-loaded data), while aiming at minimizing data movements, so as to reduce the total processing time. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
14:30	Rui Silva and João Sobral High Performance Computing with Java Streams PRESENTER: João Sobral ABSTRACT. Java streams enable an easy-to-use functional-like programming style that transparently supports parallel execution. This paper presents an approach that improves the performance of stream-based Java applications. The approach enables the effective usage of Java for HPC applications, due to data locality improvements (i.e., support for efficient data layouts), without losing the object-oriented view of data in the code. The approach extends the Java collections API to hide additional details concerning the data layout, enabling the transparent use of more memory-friendly data layouts. The enhanced Java Collection API enables an easy adaptation of existing Java codes making those Java codes suitable for HPC. Performance results show that improving the data locality can provide a two-fold performance gain in sequential stream applications, which translated into a similar gain over parallel stream implementations. Moreover, the performance is comparable to similar C implementations using OpenMP.

13:30-15:00 Session 3B: Resilience 2021 - 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Chairs:

Christian Engelmann and Stephen Scott

Location: Resilience Workshop Room

13:30

Christian Engelmann

Faults, Errors and Failures in Extreme-Scale Supercomputers

ABSTRACT. Resilience is one of the critical challenges of extreme-scale high-performance computing systems, as component counts increase, individual component reliability decreases, and software complexity increases. Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. This talk provides an overview of reliability experiences with some of the largest supercomputers in the world and recent achievements in developing a taxonomy, catalog and models that capture the observed and inferred fault, error, and failure conditions in these systems.

14:20

Jörg Keller and Sebastian Litzinger

Energy-Efficient Execution of Streaming Task Graphs with Parallelizable Tasks on Multicore Platforms with Core Failures

PRESENTER: Sebastian Litzinger

ABSTRACT. Real-time applications often take the form of streaming applications, where a stream of inputs such as camera images is processed by an application represented as a task graph. The workload together with the required throughput often necessitates processing on a multicore system and also demands parallelization of large tasks. We extend a scheduling algorithm for such applications, originally devised to handle varying task workloads, to also cover varying core count, e.g. caused by failure of a core. We use frequency scaling to accelerate processing when the necessity to re-execute tasks from the crashed core arises, to maintain throughput. We evaluate the algorithm by scheduling synthetic task graphs that represent corner cases and a real streaming application.

13:30-15:00 Session 3C: AMTE 2021 - Asynchronous Many-Task systems for Exascale 2021

Chairs:

Steven Brandt and Patrick Diehl

Location: AMTE Workshop Room

13:30

Daisy Hollman

The Past, Present, and Future of Asynchrony in C++

ABSTRACT. Over the past quarter century, C++ has emerged as the language of choice for performance-sensitive applications where large-scale software engineering challenges are present and zero-overhead abstraction is a requirement. As such, with the continued increase in hardware concurrency, the language standard and the accompanying standard library have evolved to remain at the forefront of parallel and concurrent programming model design. From the introduction of standard threads, atomics, and eager futures in C++11 aimed at use cases with several cores, to the C++17 parallel algorithms and C++20 atomic references with fewer implicit constraints for use cases on hundreds to thousands of cores, to the emerging executor and Sender/Receiver models that can lazily describe intricately interdependent work spanning vast timescales and granularities, the demand for zero-overhead abstraction in C++ has led to an ever-widening—and often daunting—set of tools for solving difficult problems at scale.

In this talk, we'll discuss the evolution of past and present abstractions for parallelism and concurrency in standard C++, explore the reasons why the current state of affairs is unsustainable, and present several of the short-term and long-term directions for the future of asynchronous programming in standard C++. We will conclude with a particular focus on the challenges associated with (and upcoming solutions for) expressing asynchrony at interface boundaries in large-scale, multi-library software stacks that compose today's production-scale applications.

14:20

Ben Bergen, Irina Demeshko, Charles Ferenbaugh, Davis Herring, Li-Ta Lo, Julien Loiseau, Navamita Ray and Andrew Reisner

FleCSI 2.0: The Flexible Computational Science Infrastructure Project

PRESENTER: Ben Bergen

ABSTRACT. The FleCSI 2.0 programming system supports multiphysics application development through a runtime abstraction layer, and by providing core topology types that can be customized for speciﬁc nu-merical methods. The abstraction layer provides a single-source programming interface for distributed and shared-memory data parallelism through task and kernel execution, and has been demonstrated to introduce virtually no runtime overhead. FleCSI’s core topology types represent a rich set of basic data structures that can be specialized to create application-facing interfaces for a variety of diﬀerent physics packages. Using the FleCSI control and data models, it is straightforward to compose multiple packages to create full multiphysics applications. When used with the Legion backend, FleCSI oﬀers extended runtime analysis that can increase task concurrency, facilitate load balancing, and allow for portability across heterogeneous computing architectures.

15:00-15:15Coffee Break

15:15-16:40 Session 4A: COLOC - 5th Workshop on Data Locality

Chairs:

Anshu Dubey and Emmanuel Jeannot

Location: COLOC Workshop Room

15:15	Neil Butcher and Peter Kogge Exploring Strategies to Improve Locality Across Many-core Affinities PRESENTER: Neil Butcher ABSTRACT. Several recent rank one systems in the Top500 include many-core chips with complex memory systems, including intermediate levels of memory, multiple memory channels, and explicit affinity of specific memory channels to specific sub-blocks of cores. Creating codes to utilize these features efficiently is thus a significant challenge. This paper uses Intel's Knights Landing (KNL) processor as a testbed, as it includes both intermediate memory and multiple architectural knobs to adjust affinity. This paper also uses a 2D Fast Fourier Transform (FFT) as a test case to explore what combination of architectural and algorithmic techniques are of most benefit. Several codes are used, including state-of-the-art FFT codes FFTW and MKL, along with two additional simple parallel 2D FFT codes exploring explicit options. The conclusions are that intermediate memory does provide a significant boost, that there are architectural modes in the memory subsystem that are better suited to FFT than others, and that a cache-oblivious FFT performs consistently across affinity modes.
15:40	Muhammet Abdullah Soyturk, Palwisha Akthar, Erhan Tezcan and Didem Unat Monitoring Collective Communication Among GPUs PRESENTER: Muhammet Abdullah Soyturk ABSTRACT. Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs. Although there are prior works to gather this information in MPI applications on distributed systems and multi-threaded applications on shared memory systems, there is no tool that identifies communication among GPUs. Our prior work, ComScribe, presents a peer-to-peer communication detection tool for GPUs sharing a common host. In this work, we extend ComScribe to identify communication among GPUs for collective and peer-to-peer communication primitives in NVIDIA's NCCL library. In addition to peer-to-peer communications, collective communications are commonly used in HPC and AI workloads thus it is important to monitor the induced data movement due to collectives. Our tool extracts the size and the frequency of data transfers in an application and visualizes them as a communication matrix. To demonstrate the tool in action, we present communication matrices and some statistics for two applications coming from machine translation and image classification domains.
16:05	William Hart ECP: Data Analytics and Optimization Applications On Accelerator-Based Systems ABSTRACT. The goal of the Exascale Computing Project (ECP) is the accelerated delivery of a capable exascale computing ecosystem to provide breakthrough solutions that address our most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. A key element of ECP is developing applications that can effectively leverage new accelerator-based exascale systems. The Data Analytics and Optimization (DAO) applications in EPC represent a strategic investment in an emerging area whose predictive capability involves modern data analysis and machine learning, complex combinatorial models of data interactions, and constrained nonlinear algebraic representations of complex systems. The ECP DAO applications reflect new mission drivers within the United States government that require data-driven models of complex systems. We will review the ECP DAO applications and discuss lessons learned that reflect similarities and differences with traditional physics-based simulation applications.

15:15-16:30 Session 4B: Resilience 2021 - 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Chairs:

Thomas Naughton and Stephen Scott

Location: Resilience Workshop Room

15:15	Ioannis Vardas, Manolis Ploumidis and Manolis Marazakis Exploring the impact of node failures on the resource allocation for parallel jobs PRESENTER: Ioannis Vardas ABSTRACT. Increasing the size and complexity of modern HPC systems also increases the probability of various types of failures. Failures may disrupt application execution and waste valuable system resources due to failed executions. In this work, we explore the effect of node failures on the completion times of MPI parallel jobs. We introduce a simulation environment that generates synthetic traces of node failures, assuming that the times between failures for each node are independently distributed, each node following the same distribution but with different parameters. We also present a resource allocation approach that considers node failure probabilities for various system partitions before assigning resources to a job. We compare the proposed approach with Slurm's resource allocation and a failure-oblivious heuristic that randomly selects the partition for a job. We present results for a case study that assumes a 4D-torus topology and a Weibull distribution for each node's time between failures. This case study considers several different traces of node failures, capturing different failure patterns. Our results show little benefit for jobs of relatively short duration. For longer jobs though, the decrease in the time needed to complete a batch of identical jobs is quite significant when compared with Slurm or the failure-oblivious heuristic, up to 82% depending on parameters of the simulated trace.
15:40	Mohit Kumar and Christian Engelmann RDPM: An Extensible Tool for Resilience Design Patterns Modeling PRESENTER: Mohit Kumar ABSTRACT. Resilience to faults, errors, and failures in extreme-scale HPC systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.
16:05	Kurt Ferreira and Scott Levy Characterizing Memory Failures Using Benford’s Law PRESENTER: Kurt Ferreira ABSTRACT. In this paper, we examine the lifetime of failures on the Cielo supercom- puter that was located at Los ALamos National Labs, looking specifically at the time between faults on this system. Through this analysis, we show that the time between uncorrectable faults for this system obeys Ben- ford’s law, This law applies to a number of naturally occurring collections of numbers and states that the leading digit is more likely to be small, for example a leading digit of 1 is more likely than 9. We also show which common distributions used to model failures also follow this law. This work provides critical analysis on the distribution of times between failures for extreme-scale systems. Specifically, the analysis in this work could be used as a simple form of failure prediction or used for modeling realistic failures.

15:15-16:30 Session 4C: AMTE 2021 - Asynchronous Many-Task systems for Exascale 2021

Panel session

Chairs:

Irina Demeshko and Patrick Diehl

Location: AMTE Workshop Room

16:40-17:30 Session 5: COLOC - 5th Workshop on Data Locality

Panel session

Chairs:

Anshu Dubey and Emmanuel Jeannot

Location: COLOC Workshop Room