ARCS2023: 36TH GI/ITG INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS
PROGRAM FOR WEDNESDAY, JUNE 14TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 4: Keynote #2

Speaker: Vasileios Karakostas

Title: Optimizing the memory access path across the computing stack

Abstract: The performance gap between accessing data and processing them has been a long-standing problem. Accessing data consists of two steps: (i) performing address translation to identify where the data lives, and (ii) retrieving the data itself.  However, two critical trends in modern computing systems prevent closing or even widen this performance gap even further. First, the memory resources become larger to satisfy the immense demand of modern applications for processing increasingly large datasets, stressing address translation. Second, the introduction of persistent memory constitutes data retrieval much faster compared to traditional devices, revealing new sources of overheads. In this talk, we will discuss the challenges and opportunities that these trends introduce, and will present some concepts and approaches for improving memory accesses across the computing stack.

Bio: Vasileios Karakostas is an Assistant Professor at the Department of Informatics and Telecommunications of the National and Kapodistrian University of Athens. He is a member of the Computer Architecture Lab. Before joining the University of Athens, he was a postdoctoral researcher at the Computing Systems Lab of the School of Electrical and Computer Engineering of the National Technical University of Athens. He received his PhD degree in Computer Architecture from Universitat Politècnica de Catalunya and Barcelona Supercomputing Center in 2016, and his Diploma in Electrical and Computer Engineering from the National Technical University of Athens. His research interests lie in the areas of computer architecture, memory systems, hardware/OS interaction, operating systems, resource management, and parallel systems. His research results have been published and awarded at international computer architecture and systems conferences. He has participated in several European research projects, and has served as reviewer in numerous conferences and journals. He is a member of IEEE, ACM, and HiPEAC.

10:30-11:20 Session 5: Dependability and Fault Tolerance (VERFE)
10:30
Error Codes in and for Network Steganography

ABSTRACT. We illustrate the inter-relationship between network steganography and error coding through examples where error codes (correction or erasure codes) are used in steganographic channels and examples where steganographic channels are established in data on which error codes are applied. In particular, we experimentally investigate an existing approach of a steganographic channel in a transmission with error correction code with respect to bandwidth, robustness and detectability, and expand this construction to provide another example of multi-level steganography, i.e., a steganographic channel within a steganographic channel.

10:55
Modified Cross Parity Codes For Adjacent Double Error Correction

ABSTRACT. The cross parity code is a well known single error correcting and a double error detecting (SEC-DED) code with fast decoding. Databits are abstractly arranged in a rectangular array. Checkbits are determined as parities along rows and columns. In this paper we propose to divide the databit array into four quadrants Q1 , Q2 , Q3 and Q4. For every quadrant a parity bit is determined. Compared to a non-modified Cross-Parity Code the number of check bits is increased by 3. All single bit errors as well as all 2-bit errors with a first error in Q 1 and a second error in Q 3 or with a first error in Q 2 and a second error in Q 4 can be corrected. Incorrectable 2-bit errors are detected. By placing bits which are for instance stored in adjacent memory elements into appropriate quadrants Q 1 and Q 3 or Q 2 and Q 4 , adjacent 2-bit errors can be corrected. Up to 2-bit check bit errors are detected as well as all burst errors shorter than the side length of the data array. Correct check bits can be recomputed from the corrected data bits. The correction is as fast as for an unmodified cross-parity code.

11:20-12:10 Session 6: Computer Architecture Co-Design
11:20
COMPESCE: A Co-design Approach for memory subsystem Performance Analysis in HPC many-cores

ABSTRACT. This paper explores the memory subsystem design through gem5 simulations of a NUMA architecture with Arm cores equipped with vector engines and connected to a Network-on-Chip (NoC) following the CHI protocol. The study quantifies the benefits of vectorization, prefetching, and multichannel NoC configurations using a benchmark for generating memory patterns and indexed accesses. The outcomes provide insights into improving bus utilization and bandwidth and reducing stalls in the system. Furthermore, the simulated environment and experiments conducted show comparable results to existing state-of-the-art high-performance computing systems, highlighting the effectiveness of the proposed design.

11:45
Post-Silicon Customization Using Deep Neural Networks

ABSTRACT. Dynamically customizing processor architecture after fabrication, also known as Post-Silicon Customization (PSC) is effective in balancing the conflicting demands of power and performance for various applications. Existing approaches either use application-specific profiles or some ad-hoc heuristics or simpler machine learning models. These techniques often do not unleash the full potential of PSC as they fail to explore and exploit PSC opportunities to a larger extent. Towards that end, we propose the first deep neural network (DNN) based PSC technique, called Forecaster. Forecaster exploits several intuitive observations to cope with the long inference latency of a DNN model and boost customization impact. Forecaster works in two phases. In Phase 1, Forecaster builds a dataset and then, selects and trains a suitable DNN model offline. In Phase 2, Forecaster periodically collects hardware telemetry and uses the trained model to customize hardware resources. We provide a detailed design and implementation of Forecaster and compare its performance against a prior state-of-the-art approach. Our experimental results indicate that on average, Forecaster provides 2.5X more power efficiency gain over the best static configuration setup while sacrificing less than 1.0% of overall performance and less than 3.5% extra system power. Compared to the prior scheme, Forecaster increases the power efficiency gain up to 1.5X while reducing the performance degradation by 44%.

13:45-15:00 Session 7: Computer Architectures and Operating Systems
13:45
TOSTING: Investigating Total Store Ordering on ARM

ABSTRACT. The Apple M1 ARM processors incorporate two memory consistency models: the conventional ARM weak memory ordering and the total store ordering (TSO) model from the x86 architecture employed by Apple’s x86 emulator, Rosetta 2. The presence of both memory ordering models on the same hardware enables us to thoroughly benchmark and compare their performance characteristics and worst-case workloads. In this paper, we assess the performance implications of TSO on the Apple M1 processor architecture. Based on various workloads, our findings indicate that TSO is, on average, 8.94 percent slower than ARM’s weaker memory ordering. Through synthetic benchmarks, we further explore the workloads that experience the most significant performance degradation due to TSO.

14:10
Back to the Core-Memory Age: Running Operating Systems in NVRAM only

ABSTRACT. Classical core memory was entirely non-volatile and could keep at least part of the operating system (OS) in main memory even across power cycles. These days we can have terabytes of NVRAM to repeat this approach, albeit on an entirely different scale and with large parts of the OS state still kept in the volatile CPU caches. In this pa- per, we discuss our experiences of running large modern operating sys- tems including their applications entirely in NVRAM. We adapted stock Linux and FreeBSD kernels to work exclusively with NVRAM by hiding all DRAM from the kernels at boot time to establish a realistic perfor- mance baseline without changing anything else. Following this entirely NVRAM-agnostic approach, we could observe an effective performance penalty of a factor of about four, but only negligible increases in whole- system power draw. For our system with two CPU sockets and 56 cores total, we also observed a reduction in power draw in several scenarios. Due to prolonged execution times, the energy consumption increased as well for these measured workloads. While this might be discouraging at first sight, this result was achieved without any performance tuning as to the specific characteristics of today’s NVRAM technology. Therefore, we are also discussing means to mitigate the observed shortcomings by integrating NVRAM appropriately into the memory hierarchy of future robust persistent systems.

14:35
Retrofitting AMD x86 processors with active virtual machine introspection capabilities

ABSTRACT. Active virtual machine introspection mechanisms intercept the control flow of a virtual machine running on top of a hypervisor. They enable external tools to monitor and inspect the state at predetermined locations of interest synchronous to the execution of the system. Such mechanisms, in particular, require support from the processor vendor by facilitating interpositioning. This support is missing on AMD x86 processors, leading to inferior introspection solutions. We outline implicit assumptions about active introspection mechanisms in previous work, offer constructions for solution strategies on AMD systems and discuss stealthiness and correctness. Finally, we show empirically that such retrofitted software solutions exhibit performance metrics in the same order of magnitude as native hardware solutions.

15:30-16:45 Session 8: Organic Computing Applications 1 (OC2)
15:30
Abstract Artificial DNA’s Improved Time Bounds

ABSTRACT. The Artificial DNA (ADNA) has proven to be a valuable tool for designing distributed embedded systems that are self-organizing, self-healing, and self-configuring. However, the practical application of ADNA has been limited by the need for extensive knowledge about the targeted hardware and available sensors, which hinders reusability and adaptability of existing ADNAs. To address this challenge, the abstract ADNA (A2DNA) has been proposed as a solution. In an A2DNA, sensor elements are replaced with abstract sensors that describe the required sensory input properties. Only when the A2DNA is initialized on the target hardware are these abstract sensors specified by a combination of actual available sensors. Furthermore, a semantic knowledge base provides information about the hardware’s sensors and their relationships. To convert an A2DNA into a hardware-specific ADNA, knowledge about how to calculate a required sensor value that cannot be directly measured from other available sensors is necessary. This paper presents and analyzes two algorithms that determine this knowledge.

15:55
Evaluating the Comprehensive Adaptive Chameleon Middleware for Mixed-Critical Cyber-Physical Networks

ABSTRACT. Cyber Physical Systems (CPS) are growing more and more complex due to the availability of cheap hardware, sensors, actuators and communication links. A network of cooperating CPSs (CPN) additionally increases the complexity. Furthermore, CPNs are often deployed in dynamic, unpredictable environments and safety-critical domains, such as transportation, energy, and healthcare. In such domains, usually applications of different criticality level exist. As a result of mixed-criticality, applications requiring hard real-time guarantees compete with those requiring soft real-time guarantees and best-effort application for the given resources within the overall system.

This poses challenges as well as it offers chances: the increasing complexity makes it harder to design, operate, optimize and maintain such CPNs. However, on the other side an appropriate use of the increasing resources in computational nodes, sensors, actuators can significantly improve the system performance, reliability and flexibility. Hence, Organic Computing concepts like self-X features (self-organization, self-adaptation, self-healing, etc.) are key principles for such systems.

Therefore, the comprehensive adaptive middleware Chameleon has been developed which applies such principles for CPNs. In this paper, the self-adaptation mechanism of Chameleon based on a MAPE-K loop and learning classifier systems is examined and evaluated. The results show its effectivity in autonomously handling the system resources to keep the required constraints of the applications with respect to their criticality.

16:20
CoLeCTs: Cooperative Learning Classifier Tables for Resource Management in MPSoCs
PRESENTER: Klajd Zyla

ABSTRACT. The increasing complexity and unpredictability of emerging applications makes it challenging for multi-processor system-on-chips to satisfy their performance requirements while keeping power consumption within bounds. In order to tackle this problem, the research community has focused on developing dynamic resource managers that aim to optimize runtime parameters, such as clock frequency, voltage and task mapping. There is a large diversity in the approaches proposed in this context, but a class of resource managers that has gained traction recently is that of reinforcement learning-based controllers. In this paper we propose CoLeCTs, a resource manager that enhances the state-of-the-art resource manager SOSA by employing a joint reward assignment policy and enabling collaborative information exchange among multiple learning agents. In this manner we tackle the suboptimal determination of local performance targets for heterogeneous applications and allow cooperative decision making for the learning agents. We evaluate and quantify the benefits of our approach via trace-based simulations.