HPCASIA 2019: INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING
  IN ASIA PACIFIC REGION, 2019
PROGRAM FOR TUESDAY, JANUARY 15TH, 2019
Days:
next day
all days

View: session overviewtalk overview

10:30-12:10 Session 1: Programming and I/O
10:30
MPI over HDFS: High-Performance Computing over a Commodity Filesystem

ABSTRACT. With the recent trend towards integrating high-performance computing (HPC) with big data (BIGDATA) processing, running MPI over HDFS offers a promising approach for delivering better scalability and fault tolerance to traditional HPC applications, but it comes with many challenges that have discouraged such an approach: (1) slow two-sided communication in MPI to support intermediate data processing, (2) a focus on enabling N-1 write that is subject to the default (and naive) HDFS block-placement policy, and (3) a pipeline writing mode in HDFS that cannot fully utilize the underlying HPC hardware. Hence, without a holistic and systematic solution, the integration of HPC and BIGDATA falls short of delivering optimal performance.

As such, we propose middleware that resides between MPI applications and HDFS in order to Aggregate and Reorder intermediate data and Coordinate (ARC) computation and I/O. ARC provides highly-optimized merge and sort for intermediate data to enrich MPI functionality and provide overlapping computational and I/O opportunities via one-sided communication. It also provides a coordinator that improves performance by leveraging data locality and communication patterns. For disk I/O, ARC realizes a parallel write mechanism with a delay write mode for fast data flush in HDFS. Collectively, ARC enforces computation, intermediate data processing, and disk I/O in an overlapped manner. To demonstrate the efficacy of ARC, we port two HPC bioinformatics applications, i.e., pBWA and DIAMOND, to ARC. The experimental results show that on a 17-node cluster, ARC can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipeline write implementations, respectively.

10:55
On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation

ABSTRACT. Over the past decade, accelerator-based supercomputers have grown from 0% to roughly 50% of the performance share on the TOP500. Ideally, GPU-accelerated code on such systems should be "write-once, run-anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due to the sheer volume of code implemented in non-portable languages. For example, the tremendous success of CUDA, as evidenced by the vast cornucopia of CUDA-accelerated applications, makes it infeasible to rewrite all these applications to achieve portability. Consequently, we achieve portability by using an automated CUDA-to-OpenCL source-to-source translator called CU2CL. To demonstrate the state of the practice, we use CU2CL to automatically translate three significant CUDA-optimized codes to OpenCL, thus enabling the codes to run on other GPU-accelerated systems (as well as CPU and FPGA-based systems). These automatically translated codes deliver performance portability, including as much as three-fold performance improvement, on a GPU device not supported by CUDA.

11:20
Distributed and Parallel Programming Paradigms on the K computer and a Cluster

ABSTRACT. In this paper, we focus on a distributed and parallel programming paradigm for massively multi-core supercomputers. We introduce YML, a development and execution environment for parallel and distributed applications based on a graph of task components scheduled at runtime and optimized for several middlewares. Then we show why YML may be well adapted to applications running on a lot of cores. The tasks are developed with the PGAS language XMP based on directives. We use YML/XMP to implement the block-wise Gaussian elimination to solve linear systems. We also implemented it with XMP and MPI without blocks. ScaLAPACK was also used to created an non-block implementation of the resolution of a dense linear system through LU factorization. Furthermore, we run it with different amount of blocks and number of processes per task. We find out that a good compromise between the number of blocks and the number of processes per task gives interesting results. YML/XMP obtains results faster than XMP on the K computer and close to XMP, MPI and ScaLAPACK on clusters of CPUs. We conclude that parallel and distributed multi-level programming paradigms like YML/XMP may be interesting solutions for extreme scale computing.

11:45
Multi-accelerator extension in OpenMP based on PGAS model

ABSTRACT. Many systems used in HPC field have multiple accelerators on a single compute node. However, programming for multiple accelerators is more difficult than that for a single accelerator. Therefore, in this paper, we propose an OpenMP extension that allows easy programming for multiple accelerators. We extend existing OpenMP syntax to create Partitioned Global Address Space (PGAS) on separated memories of several accelerators. The feature enables users to perform programming to use multiple accelerators in ease. In performance evaluation, we implement the STREAM Triad and the HIMENO benchmarks using the proposed OpenMP extension. As a result of evaluating the performance on a compute node equipped with up to four GPUs, we confirm that the proposed OpenMP extension demonstrates sufficient performance.

13:00-14:40 Session 2: Communication and Performance Modeling
13:00
An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems

ABSTRACT. Performance modeling of parallel applications on distributed memory systems is a challenging task due to the effects of CPU speed, memory access time, and communication cost. In this paper, we propose a simple and intuitive graphical model, which extends the commonly used Roofline performance model to include the communication cost in addition to the memory access time and the peak CPU performance. This new performance model inherits the simplicity of the original Roofline model and enables performance evaluation on a third dimension of communication performance. Such a model will greatly facilitate and expedite the analysis, development and optimization of parallel programs on high-end computer systems. We empirically validate the extended new Roofline model using floating-point-computation-bound, memory-bound, and communication-bound applications. Three distinct high-end computing platforms have been tested: 1) high performance computing (HPC) systems, 2) high throughput computing systems, and 3) cloud computing systems. Our experimental results with four parallel applications show that the new model can approximately evaluate the performance of different programs on various distributed-memory systems. Furthermore, the extended new model is able to provide insight into how the problem size can affect the performance of parallel applications, which is a special property revealed by the new dimension of communication cost analysis.

13:25
A Memory Saving Communication Method Using Remote Atomic Operations

ABSTRACT. The MPI library for the K computer introduced a memory saving protocol. However, the protocol still requires memory in proportion to the number of MPI processes and a memory shortage can occur when the number of processes reaches millions or tens of millions. In order to solve the problem, we propose the shared receive buffer method which is a new communication protocol using remote atomic operations. This method is easily implemented if an interconnect has remote memory access and remote atomic memory operation. We implemented shared receive buffer method on PRIMEHPC FX100 system and evaluated. The per process memory usage of the proposed method is about one tenth compared to that of existing method.

13:50
Comparative benchmarking of HPC systems for GSS applications

ABSTRACT. The work undertaken in this paper was done in the Centre of Excellence for Global Systems Science (CoeGSS), which is an interdisciplinary project, funded by the European Commission. The project provides decision-support in the face of global challenges. It brings together HPC and global systems science. This paper presents a proposition of GSS benchmark with the aim to find the most suitable HPC architecture and the best HPC system which allows to run GSS applications effectively. The GSS provides evidence about global systems challenges, e.g. the network structure of the world economy, energy, water and food supply systems, the global financial system or the global city system, and the scientific community. The outcome of the analysis is defining a benchmark which represents in the best way the GSS environment. There were defined three exemplary challenges as pilot applications: Health Habits, Green Growth and Global Urbanisation extended with additional applications from GSS ecosystem: Iterative proportional fitting (IPF), Data rastering - a preprocessing process converting all vectorial representations of georeferenced data into raster files to be later used as simulation input, Weather Research and Forecasting (WRF) model, CMAQ/CCTM (Community Air Multiscale Quality Modelling System/The CMAQ Chemistry-Transport Mode), CM1 (Cloud Modelling), ABMS (Agent-based Modelling and Simulation), OpenSWPC (An Open-source Seismic Wave Propagation Code). The above list seems to be quite rich and reflects the real GSS world as much as possible having in mind e.g. the real-world applications availability. Additionally, the authors tested new HPC platforms based on Intel® Xeon® Gold 6140,AMDEpycTM,ARMHi1616 and IBM Power8+. Due to the hardware availability, the testbed consisted of limited number of nodes. This limited the ability of providing full tests of scalability for given applications. However, this little number of available computational units (cores) can provide valuable outcome including architecture comparison for different applications based on execution times, TDPs (Thermal Design Power) and TCO (Total Cost of Ownership). These are the basic metrics used for providing a ranking of HPC architectures. Finally, this document is thought to be a valuable information for the GSS community for future purposes and analysis to determine their specific demands as well as - in general - to help develop a mature final benchmark set reflecting the GSS environment requirements and specialty. As in the number of existing benchmarks there is none dedicated to the GSS community, the authors decided to create one by calling it GSS benchmark to serve and help GSS users in their future work.

14:15
Scalable communication performance prediction using auto-generated pseudo MPI event trace

ABSTRACT. For the co-design of HPC systems and applications, it is important to study how application performance is affected by the characteristics of the future systems, not just on a computation node but also for the parallel processing including inter-node communications. Trace-driven network simulators have been widely used because of its simplicity. However, they require the trace files corresponding to the simulated system size. Therefore, if a future system is larger than a current system, we can not adopt the trace files directly; that is, it is difficult to simulate a system larger than the current system. In order to address the scaling problem in the trace-driven network simulation, we have proposed a method called SCAlable Mpi Profiler (SCAMP). The SCAMP method runs an application on a current system, obtains MPI-event trace files, copies and edits the real trace files to create a large amount of pseudo MPI-event trace files for a future system, and finally drives a network simulator by inputting the pseudo MPI-event trace files. We also implemented a pseudo MPI-event trace file generator based on the analysis of LLVM's intermediate representations. We aim to easily obtain a first-order approximation of the communication performances for various network configurations and proxy-applications. In this paper, we describe the SCAMP system design and implementation as well as several performance evaluation results.

15:10-16:50 Session 3: Applications
15:10
Acceleration of Symmetric Sparse Matrix-Vector Product using Improved Hierarchical Diagonal Blocking Format

ABSTRACT. In the previous study, Guy et al. proposed sparse matrix-vector product (SpMV) acceleration using the Hierarchical Diagonal Blocking (HDB) format that recursively repeated partitioning, reordering, and blocking on symmetric sparse matrix. The HDB format also blocks and stores sparse matrix hierarchically. In this present study, we examined two problems with the HDB format and provided a solution for each problem. First, SpMV using the HDB format has a partial dependent relationship among hierarchies. This serves as a problem with parallelism since it decreases as the hierarchy of the calculated node becomes shallow. Thus, we propose cutting of dependency using work vectors to solve this problem. Second, each node of the conventional HDB format, which is stored in Compressed Row Storage (CRS) format and accelerated by the Single Instruction to Multiple Data (SIMD) instructions, has not been studied. Hence, developed a BCRS-HDB format using the Blocked CRS (BCRS) format for node storage and restored the speedup using the SIMD instruction for HDB format. In addition, we compare the performance of 19 types of sparse matrix in the general format (CRS format, BCRS format) using the Intel Math Kernel Library (MKL), the conventional HDB format, and the expanded HDB format. The results showed that the performance was highest in the HDB format as we expanded 16 sparse matrices, which was 1.86 times faster than the CRS format.

15:35
Cache-efficient implementation and batching of tridiagonalization on manycore CPUs

ABSTRACT. We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations.

To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the ``BLAS+X'' approach, thereby improving the data reusability of the TRD algorithm and batching.

The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of $n=O(100)$, but also improves it significantly up to $n = O(1,000)$, which is our target.

16:00
An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm

ABSTRACT. The TSQR algorithm is a communication-avoiding algorithm for computing the QR factorization of a tall and skinny matrix. In the TSQR algorithm, a kernel that computes the QR factorization of a structured matrix is repeatedly executed, which is called structured QR. Although single execution of structured QR requires small amount of computational cost, it is repeated depending on the number of processes in parallel computation. Furthermore, due to its complicated computational pattern and the small matrix size, achieving high performance is difficult. The resulting cost for structured QR becomes a serious bottleneck in massively parallel computation. In this paper, we focus on the kernel of structured QR and discuss its implementation. Several kernels including those provided in LAPACK are compared on modern processors, and the impact of difference in structured QR kernels on the overall performance of the TSQR algorithm is investigated.

16:25
Numerical Simulation of Two-phase Flow in Naturally Fractured Reservoirs Using Dual Porosity Method on Parallel Computers

ABSTRACT. The two-phase oil-water flow in naturally fractured reservoirs and its numerical methods are introduced in this paper, where the fractured reservoirs are modeled by the dual porosity method. An efficient numerical scheme, including the finite difference (volume) method, CPR-FPF preconditioners for linear systems and effective decoupling methods, is presented. Parallel computing techniques employed in simulation of the two-phase flow are also presented. Using these numerical scheme and parallel techniques, a parallel reservoir simulator is developed, which is capable of simulating large-scale reservoir models. The numerical results show that this simulator is accurate and scalable compared to the commercial software and the numerical scheme is also effective.