Program for Wednesday, January 16th, 2019

PROGRAM FOR WEDNESDAY, JANUARY 16TH, 2019

Days:

10:00-11:40 Session 4: Accelerators and Applications

10:00	Raehyun Kim, Jaeyoung Choi and Myungho Lee Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512 ABSTRACT. This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.
10:25	James Clark, Tristan West, Joseph Zammit, Xiaohu Guo, Luke Mason and Duncan Russell Towards Real Time Multi-robot Routing using Quantum Computing Technologies ABSTRACT. In this paper, we are investigating the potential solutions of solving the NP-hard problem of routing multiple robots on a grid in real time using quantum computing technologies. A hybrid quantumclassical approach has been presented in detail to demonstrate the use of classical compute for candidate path generation followed by quantum annealing to perform path selection, which is generally the most time consuming part when routing multiple robots with classical compute. The performance has been benchmarked on a DWave 2000Q with up to 200 robots and has shown that producing valid solutions for the NP-hard problem of multi-robot routing is achievable with the current quantum annealing technology. The current limitations of using quantum annealing are also discussed.
10:50	Yaohua Yang, Shiqing Zhang and Li Shen A Lightweight Method for Handling Control Divergence in GPGPUs ABSTRACT. Recent graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is mainly due to the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and decreases the resource utilization. Prior works suggest that warp regrouping can mitigate this impact in some degree. But we observe that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads limiting further performance improvement. In this paper, we propose a lightweight warp regrouping method—Partial Warp Regrouping (PWR) that avoids most of the unnecessary warp regrouping by setting thresholds. And in this method, it also can reduce the complexity of hardware design. Our experimental evaluations show that this mechanism can improve the performance by 12% on average
11:15	Masahiro Nakao, Hitoshi Murai and Mitsuhisa Sato A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing with MPI/OpenMP Parallelization ABSTRACT. The network topology in various systems, such as large-scale data centers, high-performance computing systems, and Network on Chip, is strongly related to network latency. Designing a network topology with low latency can be defined as an order/degree problem (ODP) in graph theory by modeling the network topology as an undirected graph. This study proposes a method for efficiently solving ODPs based on graph symmetry and simulated annealing (SA). This method makes the network topology symmetrical, thereby improving the solution search performance of SA and drastically reducing the calculation time. The proposed method is applied to several problems from an international competition for ODPs called Graph Golf to find network topologies with sufficiently low latency. The symmetry-based calculation achieves a speed up of 31.76 times for one of the problems. Furthermore, to reduce calculation time, the proposed method is extended to use hybrid parallelization with MPI and OpenMP. As a result, a maximum speed up of 209.80 times was achieved on 20 compute nodes consisting of 400 CPU cores. Even faster performance was achieved by combining the symmetry-based calculation and hybrid parallelization.