next day
all days

View: session overviewtalk overview

09:00-10:00 Session 2: [Keynote] Dr. Arne Hamann

Title: Designing Reliable Distributed Systems

Abstract: Software is disrupting one industry after another. Currently, the automotive industry is under pressure to innovate in the area of software. New, innovative approaches to vehicles and their HW/SW architectures are required and are currently subsumed under the term “SW-defined vehicle”. However, this trend does not stop at the vehicle boundaries, but also includes communication with off-board edge and cloud services. Thinking it through further, this leads to a breakthrough technology we call “Reliable Distributed Systems”, which enables the operation of vehicles where time and safety-critical sensing and computing tasks are no longer tied to the vehicle, but can be shifted into an edge-cloud continuum. This allows a variety of novel applications and functional improvements but also has a tremendous impact on automotive HW/SW architectures and the value chain. Reliable distributed systems are not limited to automotive use cases. The ubiquitous and reliable availability of distributed computing and sensing in real-time enable novel applications and system architectures in a variety of domains: from industrial automation over building automation to consumer robotics. However, designing reliable distributed systems raises several issues and poses new challenges for edge and cloud computing stacks as well as electronic design automation.

10:00-10:30Coffee Break

Expomeloneras's Hall

10:30-12:00 Session 3A: HW-SW Reconfigurability
Towards Hardware Support for FPGA Resource Elasticity

ABSTRACT. FPGAs are increasingly being deployed in the cloud to accelerate diverse applications. They are to be shared among multiple tenants to improve the total cost of ownership. Partial reconfiguration technology enables multitenancy on FPGA by partitioning it into regions, each hosting a specific application's accelerator. However, the region's size can not be changed once they are defined, resulting in the underutilization of FPGA resources. This paper argues to divide the acceleration requirements of an application into multiple small computation modules. The devised FPGA shell can reconfigure the available PR regions with those modules and enable them to communicate with each other over Crossbar interconnect with the Wishbone bus interface. The implemented crossbar's reconfiguration ability allows to modify the number of allocated modules and allocated bandwidth to those modules while preventing any invalid communication request. The envisioned resource manager can increase or decrease the number of PR regions allocated to an application based on its acceleration requirements and PR regions' availability.

Analysis of Graph Processing in Reconfigurable Devices for Edge Computing Applications

ABSTRACT. Graph processing is an area that has received significant attention in recent years due to the substantial expansion in industries relying on data analytics. Alongside the vital role of finding relations in social networks, graph processing is also widely used in transportation to find optimal routes and biological networks to analyse sequences. The main bottleneck in graph processing is irregular memory accesses rather than computation intensity. Since computational intensity is not a driving factor, we propose a method to perform graph processing at the edge more efficiently. We believe current cloud computing solutions are still very costly and have latency issues. FPGAs have traditionally been used primarily for prototyping, with ASICs serving as the final deployment platform. Nowadays, reconfigurability plays a considerable role in reducing development time and resource usage. It demonstrates how a dedicated sparse graph processing algorithm would fit the job when analysing data with low edge density. As graph datasets grow exponentially, traversal algorithms such as breadth-first search (BFS), fundamental to many graph processing applications and metrics, become more costly to compute. Our work focuses on reviewing other implementations of breadth-first search algorithms designed for low power systems and proposing our solution that utilises advanced enhancements to achieve a significant performance boost up to 9.2x better performance in terms of MTEPS compared to other state-of-the-art solutions with a power usage of 2.32W.

A Supervisory Control Approach for Scheduling Real-time Periodic Tasks on Dynamically Reconfigurable Platforms

ABSTRACT. The dynamic partial reconfiguration (DPR) feature offered by modern FPGAs provides the flexibility of adapting the underlying hardware according to the needs of a particular situation at runtime, in response to application requirements. In recent times, DPR along with drastically reduced reconfiguration overheads has allowed the possibility of scheduling multiple real-time applications on FPGA platforms. However, in order to effectively harness the computation capacity of an FPGA floor, efficient techniques which can schedule real-time applications over both space and time are required. It may be noted that safety-critical systems often require resource-optimal solutions to reduce size, weight, cost and power consumption of the system. However, the scheduling of real-time tasks on FPGAs in the presence of non-negligible reconfiguration/context-switching overheads requires careful exploration of the state space which often makes it prohibitively expensive to be applied on-line. Hence, off-line formal approaches are often preferred in the design of reconfiguration controllers (i.e., schedulers) that are correct-by-construction as well as optimal in terms of usage of resources. In this paper, we propose a formal scheduler synthesis framework that generates an optimal scheduler for a set of non-preemptive periodic real-time tasks executing on a FPGA platform. We show the practical viability of our proposed framework by synthesizing schedulers for real-world benchmark applications and implementing them on FPGAs.

moreMCU: A Runtime-reconfigurable RISC-V Platform for Sustainable Embedded Systems
PRESENTER: Tobias Scheipel

ABSTRACT. As the number of embedded systems continues to grow, so does the amount of disposed electronic devices. This is mainly due to partially or fully outdated hardware, caused by new legal regulations in jurisdiction or cutting-edge features within a new generation of devices or hardware components. As most devices are designed without having long-term maintainability in mind and can be easily replaced without much monetary effort, it is often easier to dispose of them. This throw-away mentality, however, increases the carbon footprint enormously.

Within this work, we propose a platform that can be used to design future embedded systems in a more sustainable way by preparing them for long-term hardware adaptations. To do so, we aim to make logic updatable and re-usable while the device stays operational. This is achieved by carefully co-designing an operating system and a microcontroller platform with reconfigurable logic. In this paper, we use a RISC-V-based microcontroller running on a field-programmable gate array. The said microcontroller is designed to feature a modular pipeline and replaceable on-chip peripherals alongside a partial reconfiguration controller that can hot-swap parts of the microcontroller while it is running. It is supported by an operating system that handles the reconfiguration as well as functionality emulation, in case it is not (yet) available in hardware. Both the hardware and the software are aware of each other and can manipulate shared data structures for the management of the reconfiguration concept. The experimental evaluation that was carried out on a Artix-7 device shows the proper operation alongside performance measurements and resource utilization of the on-the-fly reconfiguration of a proof-of-concept system without affecting the execution of the remainder of the system.

10:30-12:00 Session 3B: AAMTM-1
PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices
PRESENTER: Connor Imes

ABSTRACT. Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision and natural language processing. However, such models are too compute- or memory-intensive for resource-constrained edge devices. Prior works on parallel and distributed execution primarily focus on training---rather than inference---using homogeneous accelerators in data centers. We propose PipeEdge, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger, more accurate models that otherwise cannot fit on single edge devices. PipeEdge uses an optimal partition strategy that considers heterogeneity in compute, memory, and network bandwidth. Our empirical evaluation demonstrates that PipeEdge achieves $11.88\times$ and $12.78\times$ speedup using 16 edge devices for the ViT-Huge and BERT-Large models, respectively, with no accuracy loss. Similarly, PipeEdge improves throughput for ViT-Huge (which cannot fit in a single device) by $3.93\times$ over a 4-device baseline using 16 edge devices. Finally, we show up to $4.16\times$ throughput improvement over the state-of-the-art PipeDream when using a heterogeneous set of devices.

An FPGA based Tiled Systolic Array Generator to Accelerate CNNs

ABSTRACT. The main computation in any CNN is convolution operation. This computation shows significant potential for massively parallel implementations on an FPGA. Systolic arrays with their intrinsic pipelining have been explored for CNN inference. In this paper, we present a systolic array architecture suitably designed for a novel method of convolution operation. We implement an image-kernel convolution and test it with representative image inputs to several models like LeNet-5, AlexNet, VGG-16, and Resnet-34. We compare the proposed design with conventional convolution and HLS-based designs. We limit our implementation to resource constrained FPGA: AMD-Xilinx Zynq 7020 platform. We observe that the proposed architecture outperforms the direct convolution method and HLS pipelined designs by 2x and 2.1x, respectively, on average. Since DSP blocks are scarce resources, we constrain our implementation to avoid DSP blocks and use the LUTs instead. Thus, our implementation uses nearly 9x more LUTs than baseline convolution but 8x fewer LUTs than the HLS pipelined implementation. We further accelerate the convolution throughput by 11x. We achieve this by implementing a tiled systolic architecture that completely utilises the parallel computing resources of the FPGA.

CPU-GPU Layer-Switched Low Latency CNN Inference

ABSTRACT. Convolutional Neural Networks(CNNs) inference on Heterogeneous Multi-Processor System-on-Chips(HMPSoCs) in edge devices represent cutting-edge embedded machine learning. Embedded CPU and GPU within an HMPSoC can both perform inference using CNNs. However, common practice is to run a CNN on the HMPSoC component(CPU or GPU) that provides the best performance (lowest latency) for that CNN. CNNs, however, are not monolithic and are themselves composed of several layers of different types. Some of these layers have lower latency on the CPU, while others execute faster on the GPU. In this work, we investigate the reason behind this observation. Furthermore, we propose an execution of CNN that switches between CPU and GPU at the layer granularity wherein a CNN layer executes on the component that provides it with the lowest latency. Switching between CPU and GPU back and forth mid-inference inevitably introduces additional overhead(delay) in the inference. Regardless of the overhead, we show in this work that a CPU-GPU layer switched execution results in on average having 4.72% lower CNN inference latency on Khadas VIM 3 board with Amlogic A311D HMPSoC.

Co-Optimizing Sensing and Deep Machine Learning in Automotive Cyber-Physical Systems

ABSTRACT. Accurate perception of the environment is critical to achieving safety and performance goals in emerging semi-autonomous vehicles. Building a perception architecture to support autonomy goals in vehicles requires solving many complex problems related to sensor selection and placement, sensor fusion, and machine leaning driven object detection. In this paper, we present a framework for co-optimizing sensing and machine learning to meet autonomy goals in emerging automotive cyber-physical systems. Experimental results that target level 2 autonomy goals for the Audi-TT and BMW-Minicooper vehicles demonstrate how our framework can intelligently traverse the massive design space to find robust, vehicle-specific perception architecture solutions.

Partial Evaluation in Junction Trees

ABSTRACT. One prominent method to perform inference on probabilistic graphical models is the probability propagation in trees of clusters (PPTC) algorithm. In this paper, we demonstrate the use of partial evaluation, an established technique from the compiler domain, to improve the performance of online Bayesian inference using the PPTC algorithm in the context of observed evidence. We present a metaprogramming-based method to transform a base program into an optimized version by precomputing the static input at compile time while guaranteeing behavioral equivalence. We achieve an inference time reduction of 21% on average for the UAI2014 Promedus benchmark.

10:30-12:00 Session 3C: AHSA-1
Variable-Length Instruction Set: Feature or Bug?
PRESENTER: Ihab Alshaer

ABSTRACT. With the increasing complexity of digital applications, the use of variable-length instruction sets became essential, in order to achieve higher code density and thus better performance. However, security aspects must always be considered, in particular with the significant improvement of attack techniques and equipment. Fault injection, in particular, is among the most interesting and promising attack techniques thanks to the recent advancements. In this article, we provide proper characterization, at the instruction set architecture (ISA) level, for several faulty behaviors that can be obtained when targeting a variable-length instruction set. We take into account the binary encoding of instructions, and show how the obtained behaviors depend on the alignment of the instructions in the memory. Moreover, we are also able to give a better insight on previous results from the literature, that were still partially unexplained. We also show how the observed behaviors can be exploited in various security contexts.

A CFI Verification System based on the RISC-V Instruction Trace Encoder
PRESENTER: Anthony Zgheib

ABSTRACT. Control-Flow Integrity (CFI) is used to check a program execution flow and detect whether it is correctly executed and not altered by software or physical attacks. This paper presents a CFI verification system for programs executed on RISC-V cores. Our solution is based on the RISC-V instruction Trace Encoder (TE). The TE provides information about the execution path of the user program. Two approaches are proposed. One is consistent with the RISC-V TE standard. It permits to detect instruction skip attacks on function calls, on their returns and on branch instructions. The second implies an evolution of the RISC-V TE specifications to detect more complex fault models as the corruption of any discontinuity instruction. We implemented both approaches on a RISC-V core and simulated their efficiency against Fault Injection Attacks (FIA). Compared to existing CFI solutions, our methodology does not modify the user application code nor the RISC-V compiler.

Combination of ROP Defense Mechanisms for Better Safety and Security in Embedded Systems
PRESENTER: Mario Schölzel

ABSTRACT. This paper proposes the combination of two dif-ferent techniques in order to increase the safety and security in embedded systems against ROP-attacks. Control-flow-integrity (CFI) checks are used in desktop systems, in order to protect them from various forms of attacks, but they are rarely investi-gated for embedded systems, due to their introduced overhead. We will provide an efficient software implementation of a CFI-check for ARM- and Xtensa processors. In particular, we will show that the combination of this CFI-check with another de-fense technique against ROP-attacks, significantly improves the security property. Moreover, it will also increase the safety of the system, since the combination can detect a failed ROP-attack and bring the system in a safe state, which is not possible when using each technique separately. We will also report on the in-troduced overhead in code size and run time as well as the secu-rity property.

Towards Fine-grained Side-Channel Instruction Disassembly on a System-on-Chip

ABSTRACT. Side-channel based instruction disassembly (SCBD) is a family of side-channel attacks that aims at recovering the code executed by a device from physical measurements. Over past decades researches have proved that instruction-level disassembly is feasible on simple controllers. But the computing power and architectural complexity of processors are increasing, even in constrained devices. Performing side-channel attacks on mid or high-end devices is inherently harder because of complex concurrent activities and an important amount of noise. While broad pattern identification, such as cryptographic primitives, has been proved possible, the feasibility of precise SCBD remains an open question on a complex System-on-Chip (SoC). In this work, we address some of the technical challenges involved in performing SCBD on SoCs. We propose an experimental setup and measurement methodology that enables reliable characterization of instruction-level electromagnetic (EM) leakages. We study the feasibility of three code reconstruction granularities: functional unit recognition, opcode recognition and full instruction recovery. Under a controlled experimental environment, our results show that functional unit recognition is achievable (100% classification accuracy) as well as opcode recognition (with evidence of leakage). In our setup, full instruction recovery (i.e., bit-level encoding) turned out to be more challenging. We show that the classification accuracy on instruction bits is better than random guesses and can be improved by combining multiple EM probe positions, but it is not high enough to foresee an attack in a real environment.

Side-Channel Analysis of Saber KEM Using Amplitude-Modulated EM Emanations

ABSTRACT. In the ongoing last round of NIST’s post-quantum cryptography standardization competition, side-channel analysis of finalists is a main focus of attention. While their resistance to timing, power and near field electromagnetic (EM) side-channels has been thoroughly investigated, amplitude-modulated EM emanations has not been considered so far. The attacks based on amplitude-modulated EM emanations are more stealthy because they exploit side-channels intertwined into the signal transmitted by an on-chip antenna. Thus, they can be mounted on a distance from the device under attack. In this paper, we present the first results of an amplitude-modulated EM side-channel analysis of one of the NIST PQ finalists, Saber key encapsulation mechanism (KEM), implemented on the nRF52832 (ARM Cortex-M4) system-on-chip supporting Bluetooth 5. By capturing amplitude-modulated EM emanations during decapsulation, we can recover each bit of the session key with 0.91 probability on average.

10:30-12:00 Session 3D: ASHWPA
In vitro Testbed Platform for Evaluating Small Volume Contrast Agents via Magnetic Resonance Imaging

ABSTRACT. Quantitative magnetic resonance imaging (MRI) is a non-invasive imaging method with high resolution and unlimited penetration depth. Contrast agents (CAs) can assist in disease diagnosis and tissue screening via MRI. In vitro characterization of CAs in development is often carried out using sample sizes in the milliliter range or higher. Particularly when reagent costs are high, MRI would benefit from a standard platform for precise quantification of small volume CAs (microliter scale), ultimately enabling translation from in vitro to in vivo applications. In this initial study, we developed and evaluated a microliter-scale concentric “MiSCo” testbed as a platform to optimize MRI quantification of small volume samples in vitro. The platform facilitated accurate, repeatable, and reproducible MRI quantitative measurements with a 5-fold and 30-fold increase in precision and signal-to-noise-ratio, respectively, when compared to more traditional configurations. We believe this approach could serve as a path for future improvements in the field of quantitative MRI, ensuring high sensitivity measurements of small volume CAs.

A Smart Floor Device of an Exergame Platform for Elderly Fall Prevention

ABSTRACT. The high risk of falls in the elderly and their severe consequences makes research to prevent them an important priority for public health. Technologies that enable exercise in the form of games (exergames) can improve both cognitive and physical functions, which is a prerequisite to reduce fall risk. However, the key question is whether such tools are easy to use by the elderly. In this work, the development of a smart floor device for exergames that can motivate elderly to perform physical exercises is presented. The underlying hypothesis is that technologies that do not require the use of a complex user interaction environment such as the proposed smart floor are more suitable for the elderly to use for improving their physical and cognitive functions. The design and development of the smart floor leverages on features from the Internet of Things domain and follows the design principles for system composability using artifact tiles as building blocks. The inherent ability to evaluate measures that can predict the risk of falling, such as the choice stepping reaction time, can promote the smart floor as a diagnostic tool.

GPU Based Implementation for the Pre-Processing of Radar-Based Human Activity Recognition

ABSTRACT. The correlation between age and the increase in falls is a real problem, given the growing proportion of elderly people in the population, pushing us to develop new ways to monitor the elderly. The confidentiality of radar data coupled with its richness of information can address one of the weaknesses of existing technologies, the huge amount of radar data to be processed then becoming a challenge for the speed of detection necessary for the well-being of the elderly. We introduce a new architecture using the GPU allowing a gain in processing time margins. The radar technology used is a frequency modulated continuous wave (FMCW) Ancortek radar (SDR 980AD2) commercially available. It is followed by a pre-processing chain consisting of Fast Fourier Transform, Filter and Short Time Fourier Transform (STFT) to obtain time-velocity maps or spectrograms to extract characteristics of walking and human activities. An implementation with cuFFT on Jetson Xavier allows us to move serenely to the data flow in processing. This architecture increases the performance margin for the downstream of the processing chain, the acceleration factor being 10.49 compared to an architecture presented in [19]. Continuous monitoring of the subject will save lives, minimize injuries, reduce anxiety and prevent post fall syndrome (PDS).

On the Validation of Multi-Level Personalised Health Condition Model
PRESENTER: Najma Taimoor

ABSTRACT. This paper presents a verification-based methodology to validate the model of personalised health conditions by identifying the values that may result in unsafe, unreachable, in-exhaustive, and overlapping states those can otherwise threaten the patient's life as a result of producing false alarms by accepting suspicious behaviour of the target health condition. Contemporary approaches to validating a model employ various testing, simulation and model checking techniques to recognise such vulnerabilities. However, these approaches are neither systematic nor exhaustive and thus fail to identify those false values or computations that estimate the health condition at run-time based on the sensor or input data received from various IoT medical devices. We have demonstrated our methodology by validating our example multi-level model that describes three different scenarios of Diabetes health condition.

CELR: Cloud Enhanced Local Reconstruction from Low-dose Sparse Scanning Electron Microscopy Images

ABSTRACT. Current Scanning Electron Microscopy (SEM) acquisition techniques are far too slow to capture large volumes in a feasible time. A solution is to employ low-dose and sparse imaging. By computationally denoising and inpainting an image with acceptable quality can be approximated. This approach, however, requires significant compute resources. Therefore, this paper proposes CELR, a framework, that hides the computationally expensive workload of reconstructing low-dose sparse SEM images, such that (delayed) live reconstruction is possible. Live reconstruction is possible by using Convolutional Neural Networks (CNNs) that approximate a classical reconstruction algorithm like GOAL. The reconstruction by CNNs is done locally, while recurring training of CNNs is done in the cloud. Moreover, training labels are generated by GOAL in the cloud. Next to the framework, this paper evaluates and optimizes the CNN reconstruction throughput by employing Nvidia’s TensorRT. This paper also touches upon open research questions about on-the-fly CNN training. The combination of CELR and TensorRT enables large volume acquisitions with a dwell-time of 1$\mu s$ and 10\% pixel coverage to be reconstructed on a single GPU.

12:00-13:00 Session 4: [Keynote] Dr. Heiko Koziolek

Title: Software Architecture Challenges in Industrial Process Automation: from Code Generation to Cloud-native Service Orchestration

Abstract: Large, distributed software systems with integrated embedded systems support production plant operators in controlling and supervising complex industrial processes, such as power generation, chemical refinement, or paper production. With several million lines of code these Operational Technology (OT) systems grow continuously more complex, while customers increasingly expect a higher degree of automation, easier customization, and faster time-to-market for new features. This has led to an ongoing adoption of modern Information Technology (IT) reference software architectures and approaches, e.g., middlewares, model-based development, and microservices. This talk presents illustrative examples of this trend from technology transfer projects at ABB Research, highlighting open issues and research challenges. These include information modeling in M2M middlewares for plug-and-play functionality, code generation from engineering requirements to speed up customization, as well as online updates of containerized control software on virtualized infrastructures.

13:00-14:30Lunch Break

Buffet lunch at Lopesan Baobab Resort.

14:30-16:00 Session 5A: Approximate Computing
EARL: An Efficient Approximate HaRdware Framework for AcceLerating Fault Tree Analysis

ABSTRACT. This paper proposes an efficient hardware framework that utilizes approximate computing units to mitigate the fault tree (FT) simulation time while considering accuracy and energy. To do so, first, we introduce two hybrid-precision computational units, which can operate in either accurate or approximate modes. Then, we utilize these computational units and an accuracy propagation technique to evaluate the desired accuracy level of each computational part. Finally, EARL, the proposed framework, by taking into account determined accuracy levels, in an offline stage through machine learning technique, predicts the best hardware platform (i.e., CPU, FPGA, or GPU) for executing each benchmark and provides a trade-off among simulation time, accuracy, and energy consumption. To demonstrate EARL's efficiency, we study the case of reliability calculation through fault tree analysis. However, EARL is useful in top-down structured computational problems. Our evaluations show that the proposed framework improves the simulation time and energy consumption of fault tree analysis on average by 78.1%, and 72.6%, respectively, with a negligible accuracy loss.

ESAS: Exponent Series based Approximate Square Root Design

ABSTRACT. Approximate computing is an emerging methodology that offers hardware benefits when compared with the traditional computing design at the cost of accuracy. It is highly suitable for applications which does not require precision but rather try to preserve exactness of the outcome. Many arithmetic designs have evolved over the years using approximate methodologies. Square-root is one of the common yet complex hardware unit which is often employed in image processing and communication system design application. However not much hardware implementation of square-root function is seen. In this paper a novel square-root design is proposed that offers better accuracy, and improved hardware results compared to that of the previous works. The proposed design utilizes first two terms of exponent series expansion and applies two level of approximation to evolve not only hardware efficient square-root designs but also offer improved error characteristics. The approximate Square-root design was implemented in all the three data-formats including integer, fixed, and IEEE half precision floating point. The proposed designs were validated on Sobel Edge Detection algorithm and envelope detector for communication design to provide accelerated performance.

An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness
PRESENTER: Nima Taherinejad

ABSTRACT. Nowadays, a wide range of applications can tolerate certain computational errors. Hence, approximate computing has become one of the most attractive topics in computer architecture. Reducing accuracy in computations in a premeditated and appropriate manner reduces architectural complexities, and as a result, performance, power consumption, and area can improve significantly. This paper proposes a novel approximate multiplier design. The proposed design has been implemented using 45 nm CMOS technology and has been extensively evaluated. Compared to existing approximate architectures, the proposed approximate multiplier has higher accuracy. It also achieves better results in critical path delay, power consumption, and area up to 47.54%, 75.24%, and 92.49%, respectively. Compared to the precise multipliers, our evaluations show that the critical path delay, power consumption, and area have been improved by 39%, 18%, and 6%, respectively.

A Majority-based Approximate Adder for FPGAs

ABSTRACT. The most advanced ASIC-based approximate adders are focused on gate or transistor level approximating structures. However, due to architectural differences between ASIC and FPGA, comparable performance gains for FPGA-based approximate adders cannot be obtained using ASIC-based approximation ones. In this paper, we propose a method for designing a low-error approximate adder that effectively deploys the modern FPGA structure}. We introduce an FPGA-based approximate adder, named as Majority Approximate Adder (MAA), with less error than the advanced approximate adders. MAA is constructed using an approximate part and an accurate one; i.e. the accurate part is based on a smaller carry-chain compared with the carry-chain of the corresponding accurate adder. In addition, approximate part is designed to use FPGA resources efficiently with a low mean error distance (MED). Experimental results based on Monte-Carlo simulation demonstrates that a 16-bit MAA has a 49.92\% lower MED than the state of the art FPGA-based approximate adder. MAA also takes up less area and consumes less power than other FPGA-based approximate adders in the literature.

14:30-16:00 Session 5B: AAMTM-2
Breaking (and Fixing) Channel-based Cryptographic Key Generation: A Machine Learning Approach

ABSTRACT. Transportation systems are undergoing historical developments by relying on Machine Learning and communication for Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) protocols. Such connected, intelligent and collaborative transportation systems represent a promising trend towards autonomous vehicles evolving in smart roads and cities. %the internet of vehicles is emerging. However, the safety-critical aspect of these cyber-physical systems requires a systematic study of their security and privacy. In fact, security-sensitive information could be transmitted between vehicles, or between vehicles and the infrastructure such as security alerts, payment, etc. Since asymmetric cryptography is heavy to implement on embedded time-critical devices, in addition to the complexity of PKI-based solutions, symmetric cryptography offers higher performance. However, cryptographic key generation and establishment in symmetric cryptosystems is a great challenge. Recent work proposed a key generation and establishment protocol that is based on the reciprocity and high spatial and temporal variation properties of the vehicular communication channel.

This paper investigates the limitations of channel-based key generation protocols. Based on a channel model with a machine learning approach, we show the possibility for a passive eavesdropper to generate the same key in a practical manner, thereby undermining the security of such key establishment technique. Moreover, we propose a defense based on adversarial machine learning to oveercomes this limit.

Hardware-Software Codesign of a CNN Accelerator

ABSTRACT. The explosive growth of deep learning applications based on convolutional neural network (CNN) in embedded systems is spurring the development of a hardware CNN accelerator, called a neural processing unit (NPU). In this work, we present how the hardware-software codesign methodology could be applied to the design of a novel adder-type NPU. After devising a baseline datapath that enables fully-pipelined execution of layers, we define a high-level behavior model based on which a high-level compiler and a virtual prototyping system are built concurrently. Since it is easy to change the microarchitecture of an NPU by modifying the simulation models of the hardware modules, we could explore the design space of NPU microarchitecture easily. In addition, we could evaluate the effect of hardware extensions to support various types of non-convolutional operations that recent CNN models use widely. After the final datapath is determined, we design the control structure and low-level compiler and implement the NPU prototype. Implementation results on an FPGA prototype show the viability of the proposed methodology and its outcome.

Hardware Accelerator and Neural Network Co-Optimization for Ultra-Low-Power Audio Processing Device

ABSTRACT. The increasing spread of artificial neural networks does not stop at ultralow-power edge devices. However, these very often have high computational demand and require specialized hardware accelerators to ensure the design meets power and performance constraints. The manual optimization of neural networks along with the corresponding hardware accelerators can be very challenging. This paper presents HANNAH (Hardware Accelerator and Neural Network seArcH), a framework for automated and combined hardware/software co-design of deep neural networks and hardware accelerators for resource and power-constrained edge devices. The optimization approach uses an evolution-based search algorithm, a neural network template technique, and analytical KPI models for the configurable UltraTrail hardware accelerator template in order to find an optimized neural network and accelerator configuration. We demonstrate that HANNAH can find suitable neural networks with minimized power consumption and high accuracy for different audio classification tasks such as single-class wake-word detection, multi-class keyword detection, and voice activity detection, which are superior to the related work.

SNAP: Selective NTV Heterogeneous Architectures for Power-Efficient Edge Computing

ABSTRACT. While there is a growing need to process ML inference on the edge for improved latency and extra security, general-purpose solutions alone cannot cope with the increasing performance demand under power restrictions. Considering that systolic arrays are a prominent, but also power-hungry solution, we propose a methodology to enable their use in edge devices. For that, we propose SNAP, a selective Near-Threshold Voltage (NTV) strategy to explore heterogeneous MPSoCs with two voltage islands, one at NTV, and another at nominal voltage. By adopting a dynamic programming approach, SNAP may selectively apply NTV to the systolic array and to an optimal subset of cores in RISC-V-based MPSoCs, enabling ML acceleration on the edge. Combined with a smart application mapping, the strategy increases performance by up to 18.9% over a nominal design within the same power limits.

MVSTT: A Multi-Value Computation-in-Memory based on Spin-Transfer Torque Memories

ABSTRACT. Analog Computation-in-Memory (CiM) with emerging non-volatile memories leads to a reduction in data movement and an increase in power efficiency. Spin-Transfer Torque Magnetic Memories (STT-MRAM) are one of the promising technologies for CiM architectures. Although STT-MRAM has various benefits, it does not have the potential to be used directly in analog non-binary CiM operations due to its limited levels of cell resistance states. In this paper, we propose a novel flexible Multi-Value design for STT-MRAM (MVSTT) with the potential to be used for multi-value CiM. The resolution of the operands can be adjusted during run-time depending on the application’s requirements. The benefits of the proposed scheme are quantified in representative applications such as multi-valued matrix multiplications, which is the basic of Neural Networks applications. For the multi-valued matrix multiplication, the energy and delay gain is up to 9.7× and 13.3×, respectively, to non-CiM vector matrix multiplication. Also, for the neural network, the design allows up to 10.67× reduction in the STTMRAM cells per crossbar to achieve the same inference accuracy of the binarized neural network.

14:30-16:00 Session 5C: AHSA-2
Efficient Modular Polynomial Multiplier for NTT Accelerator of Crystals-Kyber

ABSTRACT. This paper presents a hardware design that efficiently performs the number theoretic transform (NTT) for lattice-based cryptography. First, we propose an efficient modular multiplication method for lattice-based cryptography defined over Proth numbers. The proposed method is based on a K-RED technique specific to Proth numbers. In particular, we divide the intermediate result into the sign bit and the other absolute value bits and handles them separately to significantly reduce implementation costs. Then, we show a butterfly unit datapath of NTT and inverse INTT equipped with the proposed modular multiplier. We apply the proposed NTT accelerator to Crystals-Kyber, which is lattice-based cryptography, and evaluate its performance on Xilinx Artix-7. The results show that the proposed NTT accelerator is about 1.3 times more efficient in terms of area-time product (ATP) in FF and DSP than existing methods.

Open Source Hardware Design and Hardware Reverse Engineering: A Security Analysis
PRESENTER: Johanna Baehr

ABSTRACT. Major industry-led initiatives such as RISC-V and OpenTitan strive for verified, customizable and standardized products, based on a combination of Open Source Hardware (OSHW) and custom intellectual property (IP), to be used in safety and security-critical systems. The protection of these products against reverse-engineering-based threats such as IP Theft and IP Piracy, hardware Trojan (HT) insertion, and physical attacks is of equal importance as for closed source designs. OSHW generates novel threats to the security of a design and the protection of IP. This paper discusses to what extent OSHW reduces the difficulty of attacking a product. An analysis of the reverse engineering process shows that OSHW lowers the effort to retrieve broad knowledge about a product and decreases the success of related countermeasures. In a case study on a RISC-V core and an AES design, the red team uses knowledge about OSHW to circumvent logic locking protection and successfully identify the functionality and the used locking key. The paper concludes with an outlook on the secure protection of OSHW.

Implementation of the Rainbow signature scheme on SoC FPGA

ABSTRACT. Thanks to the research progress, quantum computers are slowly becoming a reality and some companies already have their working prototypes. While this is great news for some, it also means that some of the encryption algorithms used today will be rendered unsafe and obsolete. Due to this fact, NIST (US National Institute of Standards and Technology) has been running a standardization process for quantum-resistant key exchange algorithms and digital signatures. One of these is Rainbow - a signature scheme based on the fact that solving a set of random multivariate quadratic system is an NP-hard problem.

This work aims to develop an AXI connected accelerator for the Rainbow signature scheme, specifically the Ia variant. The accelerator is highly parameterizable, allowing to choose the data bus width, directly affecting the FPGA area used. It is also possible to swap components to use the design for other variants of Rainbow. This allows for a comprehensive experimental evaluation of our design.

The developed accelerator provides significant speedup compared to CPU-based computation. This paper includes detailed documentation of the design as well as performance and resource utilisation evaluation.

Be My Guess: Guessing Entropy vs. Success Rate for Evaluating Side-Channel Attacks of Secure Chips

ABSTRACT. In a theoretical context of side-channel attacks, optimal bounds between success rate and guessing entropy are derived with a simple majorization (Schur-concavity) argument. They are further theoretically refined for different versions of the classical Hamming weight leakage model, in particular assuming apriori equiprobable secret keys and additive white Gaussian measurement noise. Closed-form expressions and numerical computation are given. A study of the impact of the choice of the substitution box with respect to side-channel resistance reveals that its nonlinearity tends to homogenize the expressivity of success rate and guessing entropy. The intriguing approximate relation \(GE=1/SR\) is observed in the case of 8-bit bytes and low noise.

Electromagnetic Leakage Assessment of a Proven Higher-Order Masking of AES S-Box

ABSTRACT. Many digital systems need to provide cryptographic capabilities. A large part of these devices is easily accessible by the malicious user, and may be vulnerable to side channel attacks such as power or electromagnetic analysis. From one side, the designer has to protect the architecture with proven countermeasures; on the other, the actual implementation must be validated in order to prove the absence of undesired leakages. In this paper, we present an implementation of two optimized and proven masking schemes of order $3$ and $7$ for an embedded software AES, and prove its robustness by showing the absence of significant leakage in the nonlinear layer.

14:30-16:00 Session 5D: DTFT
Towards Resilient QDI Pipeline Implementations
PRESENTER: Zaheer Tabassam

ABSTRACT. QDI circuits are robust towards timing issues, but this elasticity makes them vulnerable in value-domain fault scenarios because data-accepting windows are flexibly defined by the handshakes, and during these windows any data transition gets latched, even those originating from single event transients. As a solution, locking the data-accepting windows after the first transition contributes to robustness, but still needs consideration. We examine WCHB variants called Interlocking-WCHB and Input/Output-Interlocking-WCHB in this respect. To highlight the relevant error triggering conditions, we chose two target circuits to investigate the behavior in detail: FIFO and pipelined multiplier. Based on the experimental results we investigate the observed errors to understand the main cause of their generation and propagation. We highlight the problematic scenarios and propose modifications in buffer styles that resolve most of these while minimizing the area overhead to 50%.

Verifying Liveness and Real-Time of OS-Based Embedded Software

ABSTRACT. Embedded devices are fundamental to a huge variety of application areas, with a wide range of complexity and criticality. Therefore, they must satisfy a variety of (non-)functional requirements, and reliable strategies are necessary to guarantee that these requirements are met. As an additional complication, modern software systems are composed from various modules that interact and interfere at runtime. Operating systems are then used to manage the concurrency, but introduce additional complexity and runtime effects. Where testing is no longer sufficient to assess the software correctness, formal methods are becoming increasingly popular. Respecting the typical layering in embedded software, we propose a generic formal modeling scheme for OS-based application tasks and the formal verification of their liveness and real-time requirements. We use UPPAAL to model the software composition as the conjunction of application tasks and the OS, and for taking the interaction and interference of tasks through the OS into account. Despite of focusing on only two requirements in this paper, the modeling strategy is generic and extensible, meaning that additional requirements can be modeled and verified in a similar manner. An evaluation shows the general benefit of the approach as well as the impact of various factors on the verification complexity and scalability.

IMMizer: An Innovative Cost-Effective Method for Minimizing Assertion Sets

ABSTRACT. Assertion-based verification is one of the viable solutions for the verification of computer systems. Assertions can be automatically generated by assertion miners however, these miners typically generate a high number of possibly redundant assertions. In turn, this results in higher costs and overheads in the verification process. Furthermore, these assertions have every so often low readability due to the high number of propositions that they contain. In this paper, an Innovative cost-effective Method for Minimizing assertion sets (IMMizer) has been proposed. IMMizer is performed by identifying Contradictory Terms. These terms present the behaviors of the design under verification which are not specified by the initial assertion sets. Subsequently, a new assertion set is extracted based on the identified Contradictory Terms. Contrary to data-mining approaches that are unable to minimize the initial assertion set, but can only rank the set according to data-mining measurements, or mutant analysis approaches that require a long execution time, IMMizer is able to minimize the initial assertion set in a very short execution time. Experimental results showed that in the best case, this method has drastically reduced the number of assertions by 93% and the memory overhead imposed on the system by 87%, without any reduction in the detection of injected mutants.

Nonlinear Compression Block Codes Search Strategy

ABSTRACT. This paper deals with extending linear compression codes by nonlinear check bits that improve the usability of decompressed patterns for testing circuits with more inputs. The earlier works used a purely random or partially random search of the nonlinear check-bits truth tables to construct the first nonlinear structures. Here, we derive deterministic rules that characterize the relationship among the nonlinear code check bits. The efficiency of the rules is demonstrated on different codes with the number of specified bits equal to three. The code parameters obtained after applying the rules overperform the parameters of the linear codes. Keeping the restrictions makes the search for the check bit truth tables faster and more efficient than can be got by a simple random search. The reached nonlinear block code (136,5,3) is the most efficient code among other loose compression codes.

Verification of Calculations of Non-Homogeneous Markov Chains Using Monte Carlo Simulation
PRESENTER: Hana Kubatova

ABSTRACT. Dependability models allow calculating the rate of events leading to a hazard state – a situation, where the safety of the modeled dependable system is violated, thus the system may cause material loss, serious injuries, or casualties. The calculation of the hazard rate of the complex non-homogeneous Markov chains is time-consuming and the accuracy of the results is questionable. We have presented two methods able to calculate the hazard rate of the complex non-homogeneous Markov chains in previous papers. Both methods achieved very accurate results, thus we compare four Monte-Carlo based simulation methods (both accuracy and time-consumption) with our methods in this paper. A simple Triple Modular Redundancy (TMR) model is used in this paper since its hazard rate can be calculated analytically.

16:00-16:30Coffee Break

Expomeloneras's Hall

16:30-17:00 Session 6B: Virtual Poster Session
Is the Whole lesser than its Parts? Breaking an Aggregation based Privacy aware Metering Algorithm

ABSTRACT. Smart metering is a mechanism through which fine-grained electricity usage data of consumers is collected periodically in a smart grid. However, a growing concern in this regard is that the leakage of consumers' consumption data may reveal their daily life patterns as the state-of-the-art metering strategies lack adequate security and privacy measures. Many proposed solutions have demonstrated how the aggregated metering information can be transformed to obscure individual consumption patterns without affecting the intended semantics of smart grid operations. In this paper, we expose a complete break of such an existing privacy-preserving metering scheme [8] by determining individual consumption patterns efficiently, thus compromising its privacy guarantees. The underlying methodology of this scheme allows us to - i) retrieve the lower bounds of the privacy parameters and ii) establish a relationship between the privacy preserved output readings and the initial input readings. Subsequently, we present a rigorous experimental validation of our proposed attacking methodology using a real-life dataset to highlight its efficacy. In summary, the present paper queries: Is the Whole lesser than its Parts? for such privacy-aware metering algorithms which attempt to reduce the information leakage of aggregated consumption patterns of the individuals.

A Hybrid Scheduling Mechanism for Multi-programming in Mixed-Criticality Systems
PRESENTER: Mohammed Bawatna

ABSTRACT. In the last decade, the rapid evolution of the Commercial-Off-The-Shelf (COTS) platforms led safety-critical systems towards integrating tasks and applications with different criticality levels in a shared hardware platform, i.e., Mixed-Criticality Systems (MCS)s. Therefore, several scheduling algorithms and approaches have been proposed upon a commonly used model, i.e., Vestal’s model. However, consolidating software functions onto shared processors cannot be implemented directly in real-life applications and industrial systems while complying with certification requirements. The existing scheduling approaches do not provide a simple solution for eliminating the interference effect among the tasks with different criticality levels on the shared processing resources. Moreover, the system mode switch guarantees the timing constraints of the high-criticality tasks throw the termination of the low-criticality tasks. In this paper, we developed a new scheduling algorithm that addresses these challenges based on the round-robin technique, which improves the overall schedulability. We compared the proposed algorithm against existing scheduling algorithms in both academia and industry using extensive experiments to evaluate it. Our results show improvements in the schedulability from 0.8% to 5.3% and from 2.7% to 10.7% compared to the conventional Earliest Deadline First with Virtual Deadline (EDF-VD) and Fixed Priority Preemptive (FPP) scheduling approaches, respectively.

A Low-complexity FPGA TDC based on a DSP Delay Line and a Wave Union Launcher

ABSTRACT. High-precision time-to-digital converters (TDCs) are key components for controlling quantum systems and FPGAs have gained popularity for this task thanks to their low-cost and flexibility compared with Application Specific Integrated Circuits (ASICs). This paper investigates a novel FPGA architecture that combines a wave union launcher and DSP-based delay lines. The configuration achieves a 8.07ps RMS resolution on a low-cost Zynq FPGA with a power usage of only 0.628W. The low power consumption is achieved thanks to a combination of operating frequency and logic resource usage that are lower than other methods, such as multi-chain DSP based TDCs and multi-chain CARRY4 based TDCs.

How are Industry 4.0 Reference Architectures Used in CPPS Development?

ABSTRACT. Industry 4.0 (I4.0) reference architectures aim to facilitate the development of cyber-physical production systems (CPPS) by providing uniform conceptual structures and terminologies. However, their use in CPPS development has not been examined to date. The goal of this paper is to analyze existing approaches that integrate I4.0 reference architectures as part of the development process. The literature is categorized in three types that are distinguished by the ways in which they make use of a reference architecture. Observations and conclusions focus on the specific reference architectures used and their architectural layers.

FP-SLIC: A Fully-Pipelined FPGA Implementation of Superpixel Image Segmentation

ABSTRACT. A superpixel segment is a group of pixels that carry similar information. The Simple Linear Iterative Clustering (SLIC) is a well-known algorithm for generating superpixels that offers a good balance between accuracy and efficiency. Nevertheless, due to its high computational requirements, the algorithm does not meet the demands of real-time embedded applications in terms of speed and resources. This paper proposes a fully-pipelined FPGA architecture of SLIC, dubbed FP-SLIC, that exhibits 1) a simplified and efficient algorithm of reduced computational complexity that facilitates algorithm development for FPGAs, 2) a fully pipelined FPGA design operating at 40MHz with a throughput of one pixel per cycle, and 3) a memory-efficient architecture that eliminates the requirement for external memory. Implementation results achieve 259 fps on the BSDS500 dataset, which is ≈ 8.6× more than the requirement for real-time performance (30 frames per second).