EURO-PAR2021: EURO-PAR: 27TH INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR WEDNESDAY, SEPTEMBER 1ST
Days:
previous day
next day
all days

View: session overviewtalk overview

11:00-12:30 Session 12: Regular Papers 1: Parallel Methods & Applications
11:00
A GPU Architecture Aware Fine-Grain Pruning Technique for Deep Neural Networks
PRESENTER: Hoeseok Yang

ABSTRACT. The model size and computation requirement of Deep Convolutional Neural Networks (DNNs) have ever increased as their applications to various real-life use-cases, e.g., autonomous driving, are getting more pervasive and popular. While DNN workloads are executed on Graphics Processing Units (GPUs) in many cases, it is not trivial to improve the inference speed through the conventional DNN weight pruning technique, due to the parallel architecture of GPUs. On the other hand, the coarse-grain pruning, also known as structured sparsity or structured pruning, can speedup the inference, but cause significant losses of accuracy. In this paper, we propose two fine-grain DNN pruning techniques that are aware of the underlying GPU architecture. The hierarchical architecture of parallel processing elements and memory of GPU are analyzed to enable the finest possible pruning where the removed weights can be safely skipped during the inference. The effectiveness of the proposed techniques has been evaluated with VGG16. Compared to existing pruning techniques, the proposed methods result in significantly improved inference speed with less accuracy drop.

11:20
Mixed Precision Incomplete and Factorized Sparse Approximate Inverse Preconditioning on GPUs
PRESENTER: Fritz Goebel

ABSTRACT. In this work we present highly efficient mixed precision GPU-implementations of an Incomplete Sparse Approximate Inverse (ISAI) preconditioner for general non-symmetric matrices and a Factorized Sparse Approximate Inverse (FSAI) preconditioner for symmetric positive definite matrices. While working with full double precision in all arithmetic operations, we showcase the benefit of reducing the precision of the floating point format in which the preconditioner is stored in order to reduce memory volume and therefore preconditioner application time.

11:40
Outsmarting the Atmospheric Turbulence for Ground-Based Telescopes Using the Stochastic Levenberg-Marquardt Method
PRESENTER: Yuxi Hong

ABSTRACT. One of the main challenges for ground-based optical astronomy is to compensate for atmospheric turbulence in near real-time. The goal is to obtain images as close as possible to the diffraction limit of the telescope. This challenge is addressed on the latest generation of giant optical telescopes by deploying multi-conjugate adaptive optics (MCAO) systems performing predictive tomography of the turbulence and multi-layer compensation. Such complex systems require a high fidelity estimate of the turbulence profile above the telescope, to be updated regularly during operations as turbulence conditions evolve. In this paper, we modify the traditional Levenberg-Marquardt (LM) algorithm by considering stochastically chosen subsystems of the full problem to identify the required parameters efficiently, while coping with the real-time challenge. While LM operates on the full set data samples, the resulting Stochastic LM (SLM) method randomly selects subsamples to compute corresponding approximate gradients and Hessians. Hence, SLM reduces the algorithmic complexity per iteration and shortens the overall time to solution, while maintaining LM's numerical robustness. We present a new convergence analysis for SLM, implement the algorithm with optimized GPU kernels, and deploy it on shared-memory systems with multiple GPU accelerators. We assess SLM in the adaptive optics system configurations in the context of the MCAO-Assisted Visible Imager \& Spectrograph (MAVIS) instrument for the Very Large Telescope (VLT). We demonstrate performance superiority of SLM over the traditional LM algorithm and the classical stochastic first-order methods. At the scale of VLT AO, SLM finishes the optimization process and accurately retrieves the parameters (e.g., turbulence strength and wind speed profiles) in less than a second using up to eight NVIDIA A100 GPUs, which permits high acuity real-time throughput over a night of observations.

12:00
GPU Accelerated Mahalanobis-average Hierarchical Clustering Analysis
PRESENTER: Adam Šmelko

ABSTRACT. Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. In case of flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. However, wide applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This paper proposes an optimized GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by over two orders of magnitude, thus allowing it to scale to much larger datasets. It also provides detailed analysis of used optimizations and experimental results which may be useful for other hierarchical-clustering problems. We have performed benchmarks on publicly available high-dimensional data from flow cytometry which demonstrates applicability of our implementation in the target domain.

12:30-14:00Lunch Break
14:00-15:30 Session 13: Regular Papers 2: Architectures & Accelerators
14:00
PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy
PRESENTER: Vladimir Dimić

ABSTRACT. The ever-increasing gap between the processor and main memory speeds requires careful utilization of the limited memory resources. This is additionally emphasized for the case of memory-bound applications. Prioritization of memory requests in the memory controller is one of the approaches to improve performance of such codes. However, current designs do not consider high-level information about parallel applications.

In this paper, we propose a holistic approach to this problem, where the runtime system's knowledge is available in hardware to make optimized decisions with regards to prioritization of memory requests with a negligible hardware cost. Our design is based on the notion of critical path in the execution of a parallel code. The critical tasks are accelerated by prioritizing their memory requests within the on-chip memory hierarchy. As a result, we reduce the critical path and improve the overall performance up to 1.19x compared to the commercially available systems.

14:20
Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware
PRESENTER: Alberto Zeni

ABSTRACT. The HPCG benchmark represents a modern complement to the HPL benchmark in the performance evaluation of HPC systems, as it has been recognized as a more representative benchmark to reflect real-world applications. While typical workloads become more and more challenging, the semiconductor industry is battling with performance scaling and power efficiency on next-generation technology nodes. As a result, the industry is turning towards more customized compute architectures to help meet the latest performance requirements. In this paper, we present the details of the first FPGA-based implementation of HPCG that takes advantage of such customized compute architectures. Our results show that our high-performance multi-FPGA implementation, using 1 and 4 Xilinx Alveo U280 achieves up to 108.3 GFlops and 346.5 GFlops respectively, representing speed-ups of 104.1x and 333.2x over software running on a server with an Intel Xeon processor with no loss of accuracy. We also demonstrate that the FPGA-based solution achieves comparable performance with respect to modern GPUs and an up to 2.7x improvement in terms of power efficiency compared to an NVIDIA Tesla V100. Finally, a theoretical evaluation, based on Berkeley's Roofline model demonstrates that our implementation is near optimally tuned on the Xilinx Alveo U280.

14:40
Exploiting co-execution with oneAPI: heterogeneity from a modern perspective
PRESENTER: Raúl Nozal

ABSTRACT. Programming efficiently heterogeneous systems is a major challenge, due to the complexity of their architectures. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static and dynamic policies. On top of that, static and dynamic load-balancing algorithms are integrated and analyzed. This work evaluates the performance and energy efficiency for a well-known set of regular and irregular HPC benchmarks, using an integrated GPU and CPU. Experimental results show that co-execution is worthwhile when using dynamic algorithms, improving efficiency even more when using unified shared memory.

15:30-16:30 Session 14: Keynote I
15:30
Big Data and Extreme-Scales: Computational Science in the 21st Century

ABSTRACT. Extreme scales and big data are essential to computational and data-enabled science and engineering is the 21st, promising dramatic new insights into natural and engineered systems. However, data-related challenges are limiting the potential impact of application workflows enabled by current and emerging extreme scale, high-performance, distributed computing environments. These data-intensive application workflows involve dynamic coordination, interactions and data coupling between multiple application processes that run at scale on different resources, and with services for monitoring, analysis and visualization and archiving, and present challenges due to increasing data volumes and complex data-coupling patterns, system energy constraints, increasing failure rates, etc. In this talk I will explore some of these challenges and investigate how solutions based on data sharing abstractions, managed data pipelines, data-staging service, and in-situ / in-transit data placement and processing can be used to help address them. This research is part of the DataSpaces project at the Scientific Computing and Imaging (SCI) Institute, University of Utah.

16:30-18:00 Session 15: Regular Papers 3: Machine Learning & Applications
16:30
Designing a 3D Parallel Memory-Aware Lattice Boltzmann Algorithm on Manycore Systems
PRESENTER: Yuankun Fu

ABSTRACT. Lattice Boltzmann method (LBM) is a promising approach to solving Computational Fluid Dynamics (CFD) problems, however, itis often considered to be memory-bound on modern computer architectures. This paper introduces novel sequential and parallel 3D memory-aware LBM algorithms to optimize its memory access performance. The designed new algorithms combine features of single-copy distribution, single sweep, swap algorithm, prism traversal, and merging two temporal time steps. We also design a methodology to guarantee thread safety and reduce synchronizations in the parallel LBM algorithm. At last, we evaluate their performances on three manycore systems and show that the new 3D memory-aware LBM algorithms outperform the state-of-the-art Palabos (which implemented the Fuse Swap Prism LBM solver) by up to 89%.

16:50
Fault-tolerant LU factorisation is low cost

ABSTRACT. At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing because as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.

17:10
Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization
PRESENTER: Haoran Wang

ABSTRACT. Deep neural networks (DNNs) are playing an increasingly important role in our daily life. Since the size of DNNs is continuously growing up, it is highly important to train them effectively by distributing computation on multiple connected devices. The efficiency of training depends on the quality of chosen parallelization strategy. Being able to find a good parallelization strategy for a DNN in a reasonable amount of time is not trivial. Previous research demonstrated the possibility to systematically generate good parallelization strategies. However, systematic partitioning still suffers from either a heavy preprocessing or poor quality of parallelization. In this paper, we take a purely symbolic analysis approach by leveraging the features of DNNs like dense tensor balanced computation. We propose the Flex-Edge Recursive Graph and the Double Recursive Algorithm, successfully limiting our parallelization strategy generation to a linear complexity with a good quality of parallelization strategy. The experiments show that our solution significantly reduces the parallelization strategy generation time from hours to seconds while maintaining the parallelization quality.

17:30
Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs
PRESENTER: Zhongyi Lin

ABSTRACT. In deep learning pipelines, we demonstrate the performance benefits and tradeoffs of combining two convolution layers into a single layer on multicore CPUs. We analyze when and why fusion may result in runtime speedups, and study three types of layer fusion: (a) 3-by-3 depthwise convolution with 1-by-1 convolution, (b) 3-by-3 convolution with 1-by-1 convolution, and (c) two 3-by-3 convolutions. We show that whether fusion is beneficial is dependent on numerous factors, including arithmetic intensity, machine balance, memory footprints, memory access pattern, and the way the output tensor is tiled. We devise a schedule for all these fusion types to automatically generate fused kernels for multicore CPUs through auto-tuning. With more than 30 layers extracted from five CNNs, we achieve a 1.04x geomean with 1.44x max speedup against separate kernels from MKLDNN, and a 1.24x geomean with 2.73x max speed up against AutoTVM-tuned separate kernels in standalone kernel benchmarks. We also show a 1.09x geomean with 1.29x max speedup against TVM, and a 2.09x geomean with 3.35x max speedup against MKLDNN-backed PyTorch, in end-to-end inference tests.