View: session overviewtalk overview
Welcome Message
Keynote Talk 1
15:00 | Energy Efficient Frequency Scaling on GPUs in Heterogeneous HPC Systems ABSTRACT. With most major corporations and research institutions having pledged to support sustainability goals for High Performance Computing (HPC), energy efficiency is a critical factor when evaluating heterogeneous HPC systems. However, many popular hardware performance & energy measurement frameworks, such as LIKWID, and benchmarks, such as the STREAM or the hipBone benchmark, do not or not fully support execution on heterogeneous systems containing AMD or NVIDIA Graphical Processing Units (GPUs), leading to a gap with regards to the understanding the relationship between frequency, performance and energy. We aim at closing this gap by extending the performance measurement framework LIKWID to support both AMD and NVIDIA GPUs. We run the STREAM and hipBone benchmark on AMD and NVIDIA GPUs at different GPU core frequencies. We show that the minimum period between two measurements for our GPU is at least 100ms and that GPUs have a sweet spot with regards to energy consumption at approximately 75% of their maximum frequency with energy savings up to 30% at a performance overhead between 0.72% and 3.12%. |
15:30 | Dual-IS: Instruction Set Modality for Efficient Instruction Level Parallelism ABSTRACT. Exploiting instruction level parallelism (ILP) is a widely used method for increasing performance of processors. While traditional very long instruction word (VLIW) processors can exploit ILP energy- efficiently thanks to static instruction scheduling, they suffer from bad code density with serial parts that cannot utilize the multi-issue capabil- ities. Transport triggered architecture (TTA) is a variation of the VLIW paradigm with an exposed datapath that improves scaling of VLIW pro- cessors with the cost of even wider instructions by exposing the datapath interconnection network to the programmer. To this end, we propose Dual-IS, an architecture that implements a TTA multi-issue and RISC-V compatible single-issue instruction-set modes by means of a microcoded control path. By utilizing the instruction set modality, the TTA mode can be used in codes that benefit from ILP, while a single-issue RISC-V ISA mode reduces instruction stream energy and code size for sequential programs. Thanks to the TTA programming model, the static multi-issue mode can be implemented without addi- tional register file ports. The processor was synthesized on a 28 nm ASIC technology. For this design point, when instruction-set mode was selected based on energy- delay product at the program granularity, with CHStone benchmarks, Dual-IS had on average 14% lower energy-delay product compared to a single-mode TTA processor with a similar datapath, while only adding a 3% overhead in the core area. Dual-IS achieved on average 15% and in the best case 33% smaller run times than a single-issue RISC-V im- plementation by running programs in the TTA mode only when it was beneficial in terms of performance. |
16:00 | Pasithea-1: An Energy-Efficient Self-Contained CGRA With RISC-Like ISA PRESENTER: Tobias Kaiser ABSTRACT. This paper presents Pasithea-1, an energy-efficient coarse-grained reconfigurable array (CGRA) with RISC-like programming interface. In contrast to traditional RISC instruction sets, which are designed for centralized von Neumann architectures, it applies RISC principles to design an instruction set for energy-efficient CGRAs. Similar to dataflow and in-place processing architectures such as TRIPS and DiAG, Pasithea-1 can execute complex application code without external control. To demonstrate its programming, mechanisms and examples for dataflow, control flow, subroutine calls, coroutines and multi-threading are shown. A microarchitecture implementing the instruction set is presented. To reduce switching activity, its instructions, once fetched, remain in fabric, where they can be executed repeatedly without re-fetching. Pasithea-1 is compared against a minimal RISC system based on placed-and-routed designs. Using netlist simulation, energy and performance were compared. With a 10.1× energy reduction in the memory hierarchy and a 3.1× overall energy reduction, the described architecture surpasses the RISC system considerably in energy efficiency. |