Program for Wednesday, May 15th

PROGRAM FOR WEDNESDAY, MAY 15TH

Days:

09:00-10:00 Session 6: Keynote by Frank Hannig: "Co-Designing Processor Arrays and their Compiler – 'The whole is greater than the sum of the parts'"

Abstract

Parallel computing is ubiquitous and can be found in a wide variety of applications, from high-performance computing to embedded systems. A key factor across all these areas is energy efficiency, which denotes the number of computations that can be performed per unit of energy. Hence, customization and a tight co-design of architecture and compiler are crucial for scaling future systems further.

This talk presents tightly coupled processor arrays (TCPAs), a class of massively parallel arrays of locally interconnected processing elements (PEs), as well as corresponding compilation concepts. TCPAs differ from coarse-grained reconfigurable arrays (CGRAs) in that the PEs are programmable, utilizing small instruction memories. They allow for the parallel execution of multiple rather than just the innermost loop dimension of many computationally intensive applications. Besides introducing the main architectural building blocks of these arrays, the presentation covers the corresponding application mapping, which starts from a functional programming language and involves symbolic loop compilation. In this approach, the loop bounds and number of available PEs can be unknown at compile time. Finally, the talk reports on the research endeavor of prototyping an 8x8 TCPA instance as a chip manufactured in 22 nm technology.

Bio

Frank Hannig received a Diploma degree in an interdisciplinary course of study in electrical engineering and computer science from the University of Paderborn, Germany, in 2000; a Ph.D. degree (Dr.-Ing.) and a Habilitation degree in computer science from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany, in 2009 and 2018, respectively. He has led the Architecture and Compiler Design Group in the Computer Science Department at FAU since 2004. His primary research interests are the design of massively parallel architectures, ranging from dedicated hardware to multicore architectures, mapping methodologies for domain-specific computing, and architecture/compiler co-design. He has authored or co-authored more than 200 peer-reviewed publications. Dr. Hannig has served on the program committees of several international conferences (ARC, ASAP, CODES+ISSS, DAC, DATE, DASIP, SAC, SAMOS) and is an associate editor of the Journal of Real-Time Image Processing and IEEE Embedded Systems Letters. He is a Senior Member of the IEEE and an affiliate member of the European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC).

Chair:

Benno Stabernack

Location: Room 3.06.H01

10:00-10:30Coffee Break

10:30-11:45 Session 7: Computer Architecture Co-Design I

Chair:

Mathias Pacher

Location: Room 3.06.H01

10:30	Chao Qian, Christopher Cichiwskyj, Tianheng Ling and Gregor Schiele Idle is the New Sleep: Configuration-Aware Alternative to Powering Off FPGA-Based DL Accelerators During Inactivity ABSTRACT. In the rapidly evolving Internet of Things (IoT) domain, we concentrate on enhancing energy efficiency in Deep Learning accelerators on FPGA-based heterogeneous platforms. Instead of focusing on the inference phase, we introduce innovative optimizations to minimize the overhead of the FPGA configuration phase. By fine-tuning configuration parameters correctly, we achieved a 40.13-fold reduction in configuration energy. Moreover, augmented with power-saving methods, our Idle-Waiting strategy outperformed the traditional On-Off strategy in duty-cycle mode for request periods up to 499.06 ms. Specifically, at a 40 ms request period within a 4147 J energy budget, this strategy extends the system lifetime to approximately 12.39 times that of the On-Off strategy. Empirically validated through hardware measurements and simulations, these optimizations provide valuable insights and practical methods for achieving energy-efficient deployments in IoT.
10:55	Daniele Passaretti and Thilo Pionteck On-the-fly CT Image Pre-Processing on MPSoC-FPGAs PRESENTER: Daniele Passaretti ABSTRACT. Due to the increasing number of tumors, new interventional Computed Tomography (CT) procedures have been proposed that aim to optimize workflow, time-effective diagnosis and treatment. To support tumor ablation, CT scanners must pre-process 2D projections and reconstruct 3D slices of the human body in real time, while data are acquired. This paper proposes a lightweight processing architecture for MPSoC-FPGAs that performs the "CT pre-processing phase" on the fly; this phase consists of the pixel processing of 2D. It is also suitable for exploring different data formats that can be selected at design time to improve performance while keeping image quality. This article focuses on the cosine and redundancy weighting steps, which can not be implemented following the standard method on embedded MPsoC-FPGAs, due to the high resource utilization costs of their arithmetic operations. Therefore, this work proposes different optimizations that permit the reduction of the number of operations to compute and the amount of on-chip memory required compared with the standard algorithm. Finally, the proposed architecture has been implemented and instantiated within a Control Data Acquisition System (CDAS) architecture running on the XC7Z045 AMD-Xilinx MPSoC-FPGA and integrated into an open-interface CT scanner assembled in our laboratory. Here, the optimized weighting steps use up to 33.8 times fewer DSPs than the implementation based on the standard solution. Furthermore, it adds only 80 ns of latency, making it 7.9 times faster than the implementation based on the standard solution.
11:20	Emmanouil Skordalakis, Andrew Attwood, John Goodacre and Mikel Luján AccProF: Increasing the Accuracy of Embedded Application Profiling using FPGAs ABSTRACT. Accurate software profiling is an essential step in the development of embedded systems. The accuracy of profiling data collected is critically important for embedded systems that operate under fixed timing constraints, which if not met, could lead to system failure. Existing profiling solutions targeting embedded systems introduce an overhead to the running application that distorts the collected profiling data. This paper proposes AccProf for System on Chips with integrated FPGAs. AccProF is a FPGA-assisted profiling framework which combines compiler and bespoke hardware. AccProF is composed of (1) an LLVM pass which inserts into the application binary running on general purpose processors lightweight instrumentation, and (2) a FPGA-based hardware capable of performing offloaded profiling. Offloading part of the profiling tasks, as well as data-structures, on to the FPGA reduces pollution of the collected profiling data leading to higher accuracy. This paper addresses on control graph profiling and evaluates AccProf on a range of benchmarks ported to SeL4 microkernel running on the AMD Zynq MpSoC. We measure performance metrics of these benchmarks across a range of processor statistics including cycles, instruction and data cache misses. We show that the impact for certain metrics is reduced for up to $5\times$ when compared against an equivalent software-based framework.

11:45-13:00Lunch Break

13:00-14:15 Session 8: Progress in HPC

Chair:

Stefan Lankes

Location: Room 3.06.H01

13:00	Lilia Zaourar, Mohamed Benazouz, Ayoub Mouhagir, Carlos Falquez, Antoni Portero, Nam Ho, Estela Suarez, Polydoros Petrakis, Manolis Marazakis, Francesco Sgherzi, Ivan Fernandez, Romain Dolbeau and Dirk Pleiter Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC ABSTRACT. The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.
13:25	Eric Borba, Reza Salkhordeh, Salim Mimouni, Eduardo Tavares, Paulo Maciel, Hossein Asadi and André Brinkmann A Hierarchical Modeling Approach for Assessing the Reliability and Performability of Burst Buffers ABSTRACT. High availability is a crucial aspect of High-Performance Computing (HPC). Solid-state drives (SSDs) offer peak bandwidth as node-local burst buffers (BB). The limited write endurance of SSDs requires thorough investigation to jeopardize the computation. We propose a hierarchical model to evaluate the reliability and performability of BBs. We developed a machine-learning model to dynamically predict storage failures according to the wear caused by different applications. We also conducted an exploratory study to analyze the workload effects on SSD failures, and a representative dataset was adopted.
13:50	Julien Rauch, Damien Rontani and Stéphane Vialle Generative-based algorithm for data clustering on hybrid classical-quantum NISQ architecture ABSTRACT. Clustering is a well-established unsupervised machine-learning approach to classify data automatically. In large datasets, the classical version of such algorithms performs well only if significant computing resources are available (e.g., GPU). An alternative approach relies on integrating a quantum processing unit (QPU) to alleviate the computing cost. This is achieved through the QPU's ability to exploit quantum effects, such as superposition and entanglement, to natively parallelize computation or approximate multidimensional distributions for probabilistic computing (Born rule). In this paper, we propose first a clustering algorithm adapted to a hybrid CPU-QPU architecture while considering the current limitations of noisy intermediate-scale quantum (NISQ) technology. Secondly, we propose a quantum algorithm that exploits the probabilistic nature of quantum physics to make the most of our QPU's potential. Our approach leverage on ideas from generative machine-learning algorithm and variational quantum algorithms (VQA) to design an hybrid QPU-CPU algorithm based on a mixture of so-called quantum circuits Born machines (QCBM). We hope to achieve accurate data clustering and acceleration on the NISQ architectures scheduled to be available in the next few years. Finally, we analyse our results and summarize the lessons learned from exploiting a CPU-QPU architecture for data clustering.

14:15-14:45Coffee Break

14:45-15:35 Session 9: Computer Architectures

Chair:

Thilo Pionteck

Location: Room 3.06.H01

14:45

Luke Panayi, Rohan Gandhi, Jim Whittaker, Vassilios Chouliaras, Martin Berger and Paul Kelly

Improving Memory Dependence Prediction with Static Analysis

ABSTRACT. This paper explores the potential of communicating information gained by static analysis from compilers to Out-of-Order (OoO) machines, focusing on the memory dependence predictor (MDP). The MDP enables loads to issue without all in-flight store addresses being known, with minimal memory order violations. We use LLVM to find loads with no dependencies and label them via their opcode. These labelled loads skip making lookups into the MDP, improving prediction accuracy by reducing false dependencies. We communicate this information in a minimally intrusive way, i.e. without introducing additional hardware costs or instruction bandwidth, providing these improvements without any additional overhead in the CPU. We find that in select cases in Spec2017, a significant number of load instructions can skip interacting with the MDP and lead to a performance gain. These results point to greater possibilities for static analysis as a source of near zero cost performance gains in future CPU designs.

15:10

Antti Nurmi, Per Lindgren, Abdesattar Kalache, Henri Lunnikivi and Timo Hämäläinen

Atalanta: Open-Source RISC-V Microcontroller for Rust-Based Hard Real-Time Systems

ABSTRACT. Real-time systems are a segment of embedded systems that have remained dominated by proprietary hardware architectures, despite the continuing growth of the open source RISC-V instruction set architecture (ISA). The introduction of core-local interrupt controller (CLIC) extensions to the RISC-V architecture presents a promising opportunity to bridge the technological gap with ARM in low-latency interrupt handling. Regarding software, the real-time interrupt-driven concurrency (RTIC) framework enables ever lighter hard real-time systems with formal compile-time guarantees for memory safety, response time and over all schedulability.

In this publication we adapt Ibex, a small, open-source RISC-V processor for CLIC support and present Atalanta, a lightweight microcontroller designed around the RTIC framework. Atalanta implements a localized memory architecture that enables low-latency context switching together with the large number of supported interrupt inputs and priorities provided by CLIC.

We evaluate Atalanta for real-time performance and implementation feasibility through simulation-based measurements and FPGA prototyping, respectively. We are able to demonstrate a worst-case interrupt latency of 5 cycles with minimal jitter and a context switch latency of 21 cycles, placing it competitively against current state-of-the-art solutions. Furthermore, we implement an FPGA prototype for the Xilinx PYNQ-Z1 and VCU118 boards, targeting a frequency of 45 MHz. We publish the sources and implementation scripts of Atalanta under a permissible open-source license.

16:00-17:40 Session 10: Organic Computing II

Location: Room 3.06.H01

16:00	Ghassan Al-Falouji, Shang Gao, Lukas Haschke, Dirk Nowotka and Sven Tomforde Enhancing Maritime Behaviour Analysis through Novel Feature Engineering and Digital Shadow Modelling: A Case Study in Kiel Fjord ABSTRACT. With the continuous evolution of maritime technology, there is a growing need for analysing and modelling vessel behaviour in complex waterway systems. This paper presents an extension and utilisation of the Surface Vessel Nautical Behaviour Analysis (SV-NBA) framework for in-depth spatio-temporal analysis of maritime surface vessels' behaviour in Kiel Fjord. Leveraging one year of collected Automatic Identification System (AIS) data, we extracted features from the recorded data. Three feature sets are generated and compared using expert-knowledge features, methods for feature selection, and Denoising Autoencoder latent space representation. Behaviour modelling and analysis utilises clustered data from the three feature sets, employing Gaussian Mixture Models (GMM). The trained GMM models serve as digital shadows, enabling storage-efficient representation of vessel behaviour and facilitates applications such as online situational awareness and marine-traffic management. These digital shadows serve as observer-layer in a dynamic autonomous system of system, offering insights into maritime activities and enhancing navigation safety in busy waterways. This research contributes to the advancement of autonomous navigation systems and supports efficient maritime traffic management strategies.
16:25	Sourav Modak and Anthony Stein Synthesizing Training Data for Intelligent Weed Control Systems Using Generative AI PRESENTER: Sourav Modak ABSTRACT. Deep Learning already plays a pivotal role in technical systems performing various crop protection tasks, including weed detection, disease diagnosis, and pest monitoring. However, the efficacy of such data-driven models heavily relies on large and high-quality datasets, which are often scarce and costly to acquire in agricultural contexts. To address the overarching challenge of data scarcity, augmentation techniques have emerged as a popular strategy to expand training data amount and variation. Traditional data augmentation methods, however, often fall short in reliably replicating real-world conditions and also lack diversity in the augmented images, hindering robust model training. In this paper, we introduce a novel methodology for synthetic image generation designed specifically for object detection tasks in the agricultural context of weed control. We propose a pipeline architecture for synthetic image generation that incorporates a foundation model called Segment Anything Model (SAM), which allows for zero-shot transfer to new domains, along with the recent generative AI-based Stable Diffusion Model. Our methodology aims to produce synthetic training images that accurately capture characteristic weed and background features while replicating the authentic style and variability inherent in real-world images with high fidelity. In view of the integration of our approach into intelligent technical systems, such a pipeline paves the way for continual selfimprovement of the perception modules when put into a self-reflection loop. First experiments on real weed image data from a current research project reveal our method’s capability to reconstruct the innate features of real-world weed infested scenes from an outdoor experimental setting.
16:50	Glen te Hofsté, Andreas Lund, Marco Ottavi and Daniel Lüdtke Towards the Online Reconfiguration of a Dependable Distributed On-board Computer PRESENTER: Glen te Hofsté ABSTRACT. On-board Computers (OBC) are at the centre of space-faring systems. They provide computational performance to the system with high availability and dependability. However, these systems typically consist of expensive, slow, fault-tolerant hardware to cope with errors or failures during a mission. Commercial-off-the-shelf (COTS) components offer higher performance but do not provide the fault-tolerance mechanisms. The ScOSA (Scalable On-board Computing for Space Avionics) architecture uses COTS and rad-hard components as a distributed system, with the advantage of providing more computing performance than current OBCs while maintaining the dependability properties. ScOSA uses a middleware to manage the COTS components as a distributed system of nodes, which, in the event of a node failure, mitigates the effects by reconfiguring the system to a configuration that excludes the failed node using a pre-determined configuration. These configurations are computed offline and have an exponentially growing memory usage depending on the number of nodes in the system, which limits the system's scalability. This paper presents an online reconfiguration algorithm as a solution to this scalability problem. Upon the occurrence of a node failure event, the online algorithm makes scheduling decisions at run-time, eliminating the need for pre-determined configurations. A novel online scheduling mechanism, consisting of six phases, which includes a combination of fault-tolerance, parallelism, and the use of the real-time state of the system, is a step towards higher dependability in distributed on-board computing. The online reconfiguration is evaluated by comparing it to the offline reconfiguration in terms of time and network traffic, showing that it is not only capable of generating configurations dynamically but also provides a solution to the scalability problem.
17:15	Markus Timotheus Kisselbach, Philipp Wörner, Uwe Brinkschulte and Mathias Pacher An Organic Computing Approach for CARLA Simulator ABSTRACT. Autonomous vehicles are increasingly being equipped with a large number of Electronic Control Units. This development leads to an increasing complexity of the systems, which in turn increases the probability of failures and unforeseen errors. To address these challenges, this article presents the integration of the Artificial DNA (ADNA)-based Organic Computing approach into the CAR Learning to Act (CARLA) simulator. CARLA is a powerful tool for the automotive industry to explore autonomous driving in a cost-efficient way. It therefore offers an ideal environment for testing innovative solutions from the field of Organic Computing. The research objective is to implement and evaluate Organic Computing methods in a vehicle environment in order to increase the reliability of vehicle functions. Thanks to the ADNA-based Organic Computing approach, the self-* properties of vehicles are available, and their driving behaviour can be researched. The first experiments will be presented in which the vehicle is controlled both manually and autonomously entirely by ADNA-based Organic Computing.

18:00-20:00 Social Event: Sightseeing in Potsdam

Social Event