View: session overviewtalk overview
11:00 | A GPU Architecture Aware Fine-Grain Pruning Technique for Deep Neural Networks PRESENTER: Hoeseok Yang ABSTRACT. The model size and computation requirement of Deep Convolutional Neural Networks (DNNs) have ever increased as their applications to various real-life use-cases, e.g., autonomous driving, are getting more pervasive and popular. While DNN workloads are executed on Graphics Processing Units (GPUs) in many cases, it is not trivial to improve the inference speed through the conventional DNN weight pruning technique, due to the parallel architecture of GPUs. On the other hand, the coarse-grain pruning, also known as structured sparsity or structured pruning, can speedup the inference, but cause significant losses of accuracy. In this paper, we propose two fine-grain DNN pruning techniques that are aware of the underlying GPU architecture. The hierarchical architecture of parallel processing elements and memory of GPU are analyzed to enable the finest possible pruning where the removed weights can be safely skipped during the inference. The effectiveness of the proposed techniques has been evaluated with VGG16. Compared to existing pruning techniques, the proposed methods result in significantly improved inference speed with less accuracy drop. |
11:20 | Mixed Precision Incomplete and Factorized Sparse Approximate Inverse Preconditioning on GPUs PRESENTER: Fritz Goebel ABSTRACT. In this work we present highly efficient mixed precision GPU-implementations of an Incomplete Sparse Approximate Inverse (ISAI) preconditioner for general non-symmetric matrices and a Factorized Sparse Approximate Inverse (FSAI) preconditioner for symmetric positive definite matrices. While working with full double precision in all arithmetic operations, we showcase the benefit of reducing the precision of the floating point format in which the preconditioner is stored in order to reduce memory volume and therefore preconditioner application time. |
11:40 | Outsmarting the Atmospheric Turbulence for Ground-Based Telescopes Using the Stochastic Levenberg-Marquardt Method PRESENTER: Yuxi Hong ABSTRACT. One of the main challenges for ground-based optical astronomy is to compensate for atmospheric turbulence in near real-time. The goal is to obtain images as close as possible to the diffraction limit of the telescope. This challenge is addressed on the latest generation of giant optical telescopes by deploying multi-conjugate adaptive optics (MCAO) systems performing predictive tomography of the turbulence and multi-layer compensation. Such complex systems require a high fidelity estimate of the turbulence profile above the telescope, to be updated regularly during operations as turbulence conditions evolve. In this paper, we modify the traditional Levenberg-Marquardt (LM) algorithm by considering stochastically chosen subsystems of the full problem to identify the required parameters efficiently, while coping with the real-time challenge. While LM operates on the full set data samples, the resulting Stochastic LM (SLM) method randomly selects subsamples to compute corresponding approximate gradients and Hessians. Hence, SLM reduces the algorithmic complexity per iteration and shortens the overall time to solution, while maintaining LM's numerical robustness. We present a new convergence analysis for SLM, implement the algorithm with optimized GPU kernels, and deploy it on shared-memory systems with multiple GPU accelerators. We assess SLM in the adaptive optics system configurations in the context of the MCAO-Assisted Visible Imager \& Spectrograph (MAVIS) instrument for the Very Large Telescope (VLT). We demonstrate performance superiority of SLM over the traditional LM algorithm and the classical stochastic first-order methods. At the scale of VLT AO, SLM finishes the optimization process and accurately retrieves the parameters (e.g., turbulence strength and wind speed profiles) in less than a second using up to eight NVIDIA A100 GPUs, which permits high acuity real-time throughput over a night of observations. |
12:00 | GPU Accelerated Mahalanobis-average Hierarchical Clustering Analysis PRESENTER: Adam Šmelko ABSTRACT. Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. In case of flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. However, wide applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This paper proposes an optimized GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by over two orders of magnitude, thus allowing it to scale to much larger datasets. It also provides detailed analysis of used optimizations and experimental results which may be useful for other hierarchical-clustering problems. We have performed benchmarks on publicly available high-dimensional data from flow cytometry which demonstrates applicability of our implementation in the target domain. |
14:00 | PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy PRESENTER: Vladimir Dimić ABSTRACT. The ever-increasing gap between the processor and main memory speeds requires careful utilization of the limited memory resources. This is additionally emphasized for the case of memory-bound applications. Prioritization of memory requests in the memory controller is one of the approaches to improve performance of such codes. However, current designs do not consider high-level information about parallel applications. In this paper, we propose a holistic approach to this problem, where the runtime system's knowledge is available in hardware to make optimized decisions with regards to prioritization of memory requests with a negligible hardware cost. Our design is based on the notion of critical path in the execution of a parallel code. The critical tasks are accelerated by prioritizing their memory requests within the on-chip memory hierarchy. As a result, we reduce the critical path and improve the overall performance up to 1.19x compared to the commercially available systems. |
14:20 | Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware PRESENTER: Alberto Zeni ABSTRACT. The HPCG benchmark represents a modern complement to the HPL benchmark in the performance evaluation of HPC systems, as it has been recognized as a more representative benchmark to reflect real-world applications. While typical workloads become more and more challenging, the semiconductor industry is battling with performance scaling and power efficiency on next-generation technology nodes. As a result, the industry is turning towards more customized compute architectures to help meet the latest performance requirements. In this paper, we present the details of the first FPGA-based implementation of HPCG that takes advantage of such customized compute architectures. Our results show that our high-performance multi-FPGA implementation, using 1 and 4 Xilinx Alveo U280 achieves up to 108.3 GFlops and 346.5 GFlops respectively, representing speed-ups of 104.1x and 333.2x over software running on a server with an Intel Xeon processor with no loss of accuracy. We also demonstrate that the FPGA-based solution achieves comparable performance with respect to modern GPUs and an up to 2.7x improvement in terms of power efficiency compared to an NVIDIA Tesla V100. Finally, a theoretical evaluation, based on Berkeley's Roofline model demonstrates that our implementation is near optimally tuned on the Xilinx Alveo U280. |
14:40 | Exploiting co-execution with oneAPI: heterogeneity from a modern perspective PRESENTER: Raúl Nozal ABSTRACT. Programming efficiently heterogeneous systems is a major challenge, due to the complexity of their architectures. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static and dynamic policies. On top of that, static and dynamic load-balancing algorithms are integrated and analyzed. This work evaluates the performance and energy efficiency for a well-known set of regular and irregular HPC benchmarks, using an integrated GPU and CPU. Experimental results show that co-execution is worthwhile when using dynamic algorithms, improving efficiency even more when using unified shared memory. |
16:30 | PRESENTER: Yuankun Fu ABSTRACT. Lattice Boltzmann method (LBM) is a promising approach to solving Computational Fluid Dynamics (CFD) problems, however, itis often considered to be memory-bound on modern computer architectures. This paper introduces novel sequential and parallel 3D memory-aware LBM algorithms to optimize its memory access performance. The designed new algorithms combine features of single-copy distribution, single sweep, swap algorithm, prism traversal, and merging two temporal time steps. We also design a methodology to guarantee thread safety and reduce synchronizations in the parallel LBM algorithm. At last, we evaluate their performances on three manycore systems and show that the new 3D memory-aware LBM algorithms outperform the state-of-the-art Palabos (which implemented the Fuse Swap Prism LBM solver) by up to 89%. |
16:50 | Fault-tolerant LU factorisation is low cost PRESENTER: Daniel Alberto Torres Gonzalez ABSTRACT. At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing because as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution. |
17:10 | Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization PRESENTER: Haoran Wang ABSTRACT. Deep neural networks (DNNs) are playing an increasingly important role in our daily life. Since the size of DNNs is continuously growing up, it is highly important to train them effectively by distributing computation on multiple connected devices. The efficiency of training depends on the quality of chosen parallelization strategy. Being able to find a good parallelization strategy for a DNN in a reasonable amount of time is not trivial. Previous research demonstrated the possibility to systematically generate good parallelization strategies. However, systematic partitioning still suffers from either a heavy preprocessing or poor quality of parallelization. In this paper, we take a purely symbolic analysis approach by leveraging the features of DNNs like dense tensor balanced computation. We propose the Flex-Edge Recursive Graph and the Double Recursive Algorithm, successfully limiting our parallelization strategy generation to a linear complexity with a good quality of parallelization strategy. The experiments show that our solution significantly reduces the parallelization strategy generation time from hours to seconds while maintaining the parallelization quality. |
17:30 | PRESENTER: Zhongyi Lin ABSTRACT. In deep learning pipelines, we demonstrate the performance benefits and tradeoffs of combining two convolution layers into a single layer on multicore CPUs. We analyze when and why fusion may result in runtime speedups, and study three types of layer fusion: (a) 3-by-3 depthwise convolution with 1-by-1 convolution, (b) 3-by-3 convolution with 1-by-1 convolution, and (c) two 3-by-3 convolutions. We show that whether fusion is beneficial is dependent on numerous factors, including arithmetic intensity, machine balance, memory footprints, memory access pattern, and the way the output tensor is tiled. We devise a schedule for all these fusion types to automatically generate fused kernels for multicore CPUs through auto-tuning. With more than 30 layers extracted from five CNNs, we achieve a 1.04x geomean with 1.44x max speedup against separate kernels from MKLDNN, and a 1.24x geomean with 2.73x max speed up against AutoTVM-tuned separate kernels in standalone kernel benchmarks. We also show a 1.09x geomean with 1.29x max speedup against TVM, and a 2.09x geomean with 3.35x max speedup against MKLDNN-backed PyTorch, in end-to-end inference tests. |