Program for Thursday, October 21st

PROGRAM FOR THURSDAY, OCTOBER 21ST

Days:

Chair:

10:00	Amit Ruhela Welcome - Day3
10:15	Kiran Bhaganagar Extreme computing and Extreme events in the environment
11:15	Yehonatan Fridman, Yaniv Snir, Matan Rusanovsky, Kfir Zvi, Harel Levin, Danny Hendler, Hagit Attiya and Gal Oren Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing ABSTRACT. As the High Performance Computing (HPC) world moves towards the Exa-Scale era, huge amounts of data should be analyzed, manipulated and stored. In the traditional storage/memory hierarchy, each compute node retains its data objects in its local volatile DRAM. Whenever the DRAM’s capacity becomes insufficient for storing this data, the computation should either be distributed between several compute nodes, or some portion of these data objects must be stored in a non-volatile block device such as a hard disk drive (HDD) or an SSD storage device. These standard block devices offer large and relatively cheap non-volatile storage, but their access times are orders-of-magnitude slower than those of DRAM. Optane DataCenter Persistent Memory Module (DCPMM), a new technology introduced by Intel, provides non-volatile memory that can be plugged into standard memory bus slots (DDR DIMMs) and therefore be accessed much faster than standard storage devices. In this work, we present and analyze the results of a comprehensive performance assessment of several ways in which DCPMM can 1) replace standard storage devices, and 2) replace or augment DRAM for improving the performance of HPC scientific computations. To achieve this goal, we have configured an HPC system such that DCPMM can service I/O operations of scientific applications, replace standard storage devices and file systems (specifically for diagnostics and checkpoint-restarting), and serve for expanding applications’ main memory. We focus on keeping the scientific codes with as few changes as possible, while allowing them to access the NVM transparently as if they access persistent storage. Our results show that DCPMM allows scientific applications to fully utilize nodes’ locality by providing them with sufficiently-large main memory. Moreover, it can also be used for providing a high-performance replacement for persistent storage. Thus, the usage of DCPMM has the potential of replacing standard HDD and SSD storage devices in HPC architectures and enabling a more efficient platform for modern supercomputing applications.
11:45	Richard Barella Clustering Real-time Test Failure Output for Triage and Analysis ABSTRACT. With hundreds of test failures each day, the workload for a dozen High-Performance Computing validation engineers becomes difficult to keep track of test failures and remember which failures have been seen before and which are unique. It is also difficult to take a look at the failures for a single regression run and easily tell which tests failed for the same reason. To solve both of these problems, the TextClust algorithm is used to cluster similar results together in real-time. First test failure text is preprocessed by removing lines that appeared in passed test results, then the processed text is passed into the TextClust algorithm for real-time clustering, and finally, the result is displayed in a web app showing a summary for each group of similar test failures.

12:45-14:30 Session 8: Tutorial

Chair:

Thomas Steinke

12:45

Mohamad Chaarawi

DAOS Update and Tutorial

ABSTRACT. DAOS (Distributed Asynchronous Object Storage) is a nextgen open-source scale-out storage stack designed from the ground up for persistent memory and NVMe storage offering extremely high bandwidth and IOPS. DAOS presents a rich, scalable storage interface unconstrained by traditional POSIX limitations and directly integrated with several data formats (e.g. HDF5) and frameworks (e.g. Spark). DAOS supports POSIX namespace encapsulation with relaxed compliance for smooth migration of applications. This talk will present a development update on DAOS, the upcoming 2.0 release and features, and a progress update on middleware support. We will also present the current applications that are being tested with DAOS, recent IO-500 performance numbers, and some of the systems and resources available for users to test their apps with DAOS.

A tutorial/demo on how users will interact with DAOS and how it will integrate into existing application workflows will be given. A "field guide" will be presented to attendees on how to make the choice of which middleware and how to use that middleware with DAOS depending on the apps and workflows.

Finally, we outline some of the future directions and research ideas being explored within DAOS (accelerators, pipelining API etc.).

14:15

Amit Ruhela

Conference Closing Remarks - Day3