DSD2021: EUROMICRO CONFERENCE ON DIGITAL SYSTEMS DESIGN 2021
PROGRAM FOR WEDNESDAY, SEPTEMBER 1ST
Days:
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 1: KEYNOTE 1 DSD

Keynote 1Prof. Dr. Mehrdad Dianati - University of Warwick (England) - Enabling and harvesting the benefits of cooperation among connected automated vehicles

10:00-11:30 Session 2A: ACCELERATORS

FPGA ACCELERATORS

10:00
A Framework for Hardware-Accelerated Design Space Exploration for Approximate Computing on FPGA

ABSTRACT. The demands for both increased performance and low power consumption on computing devices are outpacing technological improvements. Approximate computing is a design paradigm to leverage inherent error resilience of applications and trades in quality to reduce resource usage. Numerous approaches for approximation on FPGAs have been proposed in recent years and combining different methods can increase the resulting benefits in complex systems. Interactions between system components and error propagation necessitate a global design space exploration for the optimization of the approximation parameters. The loss of quality can be assessed by employing application-specific reference error metrics like PSNR or CIELAB ∆E which are well understood by designers and take error propagation implicitly into account. However, using a reference error metric can be very time-consuming, slowing down the design space exploration. To overcome this problem, we propose a framework for fast design space exploration of approximated FPGA designs in which the quality estimation is offloaded to an FPGA-based accelerator while the rest of the design space exploration is handled by a workstation PC. We evaluate the proposed framework on an image processing pipeline which is used to adapt image colors to be displayed correctly on a monitor. Our experiments show that using the accelerator yields similar results compared to a software-only setup and can speed up the exploration by a factor of over 200x.

10:30
A RISC-V-based FPGA Overlay to SimplifyEmbedded Accelerator Deployment

ABSTRACT. Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a compute platform, to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency, at the cost of increased design complexity. While High-Level Synthesis (HLS) is nowadays widely adopted to ease hardware IP design, automated tools still lack the required maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. In this paper, we present an innovative hardware overlay that efficiently abstracts the FPGA hardware details, offering a simpler methodology for the deployment of application-specific accelerators and the programmability of the resulting HW/SW platform. Our overlay is designed to allow plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA, and offer a simplified software management of the acceleration IP. We provide a detailed characterization of the proposed overlay, highlighting the costs of our design, and the simplified deployment/programming methodology. Our experiments demonstrate that the overlay can reach competitive performance levels, even if running at a relatively low operating frequency, as well as the ease of integration of custom IPs.

11:00
A Power-Efficient Parameter Quantization Technique for CNN Accelerators

ABSTRACT. Quantization techniques are widely used in CNN inference to reduce the cost of hardware at the expense of small accuracy losses. However, after the quantization, there is still a multiplication cost for the fixed-point quantized CNN weights. Therefore, a novel CNN quantization technique is introduced, which can be implemented without using any multiplier. We evaluated our quantization technique using VGG-16 and Alexnet networks, and the Tiny ImageNet dataset. The quantization technique causes 0.39% and 0.98% accuracy losses for the 8-bit CNN weights compared to floating-point implementations of VGG-16 and Alexnet, respectively. After, a fine-tuning method for our quantization is introduced, which further reduces the accuracy loss. The fine-tuning reduced the accuracy losses on 8-bit quantized VGG-16 and Alexnet to 0.24% and 0.39%, respectively. Two different processing element architectures, which do not include any multiplier hardware, are designed to perform multiply-accumulate (MAC) operations of CNN models quantized by our technique. Two different systolic array prototypes are designed employing the two PE architectures to compare with the traditional fixed-point MAC implementation. The systolic array architectures containing our processing element designs reduced the power consumption of the systolic array up to 14.2% and 21.6%.

10:00-11:30 Session 2B: VIDEO PROCESSING

VIDEO PROCESSING

10:00
A Connected Component Labelling algorithm for multi-pixel per clock cycle video stream

ABSTRACT. This work describes the hardware implementation of a connected component labelling (CCL) module in reprogammable logic. The main novelty of the design is the "full", i.e. without any simplifications, support of a 4 pixel per clock format (4 ppc) and real-time processing of a 4K/UltraHD video stream (3840 x 2160 pixels) at 60 frames per second. To achieve this, a special labelling method was designed and a functionality that stops the input data stream in order to process pixel groups which require writing more than one merger into the equivalence table. The proposed module was verified in simulation and in hardware on the Xilinx Zynq Ultrascale+ MPSoC chip on the ZCU104 evaluation board.

10:30
An adaptive pixel accumulation algorithm for a 1D micro-scanning LiDAR

ABSTRACT. In advanced driver-assistance systems, LiDAR data are used for range detection and obstacle avoidance in combination with other sensors. The frame rate of a LiDAR sensor corresponds to the data availability that is crucial for efficient data fusion. In 1D micro-scanning LiDAR, pixel accumulation is introduced to increase data signal-to-noise ratio and typically performed a fixed number of times that directly affects pixel acquisition time and frame rate. In this paper, we present an adaptive pixel accumulation algorithm that not only reduces required on-chip memory array by compressing LiDAR raw data, but also increases data availability for occupancy grid computation by enabling an early peak detection and eliminating unnecessary accumulation cycles whenever possible. Furthermore, we implemented this concept on FPGA and compared its efficiency with a state-of-the-art approach.

11:00
BarMan: Managing the Resource Continuum in a Real Video Surveillance Scenario

ABSTRACT. Over the last years, the number of IoT devices has grown exponentially, highlighting the current Cloud infrastructure limitations. In this regard, Fog and Edge computing began to move part of the computation closer to data sources by exploiting interconnected devices as part of a single heterogeneous and distributed system in a computing continuum viewpoint. Since these devices are typically heterogeneous in terms of performance, features, and capabilities, this perspective should encompass programming models and run-time management layers. This work presents and evaluates the BarMan open-source framework by implementing a Fog video surveillance use-case. BarMan leverages a task-based programming model combined with a run-time resource manager and the novel BeeR framework to deploy the application's tasks transparently. Moreover, we developed a task allocation policy to maximize application performance, considering run-time aspects, such as load and connectivity, of the time-varying available devices. Through an experimental evaluation performed on a real cluster equipped with heterogeneous embedded boards, we evaluated different execution scenarios to show the framework's functionality and the benefit of a distributed approach, leading up to an improvement of 66% on the frame processing latency.

11:30-13:00 Session 3A: COPROCESSORS

COPROCESSORS

11:30
An efficient FPGA-based co-processor for sparse optical flow calculation in drone agents

ABSTRACT. The use of mobile agents is propagating throughout various industries. Nevertheless, the success of novel applications relies on the utilization of novel computing platforms and algorithms, including acceleration technology and onboard localization. We propose an FPGA-based sparse optical flow computing accelerator based on the FAST feature detection and BRIEF feature descriptor. The correspondences are found by splitting the image into static regions, where for each region, the feature points are tracked in-between the frames. The accelerator is fully pipelined and achieves a performance of 300 fps with VGA resolution images. The experimentation with the default configuration of the accelerator shows to support a reliable measurement of frame-to-frame image plane rotation of 9 degrees and translation of 24 pixels, with the total error below 0.4 degrees and 0.16 pixels.

12:00
Vector Processing Unit: A RISC-V based SIMD Co-processor for Embedded Processing

ABSTRACT. The computational intensity in embedded processing applications is increasing. This requires domain-specific embedded platforms in order to achieve maximum performance per watt of the system. With the arrival of open-source instruction set architectures e.g. RISC-V and different domain-specific architecture development toolchains, the trend of application-specific architectures is increasing. In this paper, a parameterizable Vector Processing Unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture for embedded processing. Two key configurable parameters for the proposed VPU are vector length (VLEN) and the number of lanes (execution functional units). These parameters allow design space exploration for the VPU for different configurations and help to understand which application scenarios would fit for certain configurations. The proposed VPU was integrated into a 32-bit RISC-V processor. For maximum parallelization configuration, 2.3 x fewer cycles per instructions were achieved as compared to a RISC-V processor. Moreover, a relative cycle gain of 33-73% was achieved for different configurations as compared with the RISC-V processor.

12:20
An Open-Source Framework for the Generation of RISC-V Processor + CGRA Accelerator Systems

ABSTRACT. We describe a framework for automated generation of hybrid processor/accelerator systems comprising a RISC-V processor, and a coarse-grained reconfigurable array (CGRA) for realizing compute-kernel acceleration. CGRAs are programmable hardware platforms having an array of coarse ALU-like processing elements, and word-wide programmable interconnect. The proposed framework integrates CGRAs generated by the open-source CGRA-ME tool, with the RISC-V processor from the PULP project. In an experimental study, we use the framework to generate RISC-V+CGRA systems that provide an order-of- magnitude speedup vs. software and considerable speedup vs. a vector processor on several applications by leveraging the CGRA spatial and pipeline parallelism. As CGRA-ME permits a variety of different CGRAs to be modelled and mapped to, we believe the proposed framework represents a powerful open-source platform, enabling a variety of new research on processor/CGRA system architectures and programming models.

11:30-13:00 Session 3B: SYNTHESYS

SYNTHESYS

11:30
A Boolean Heuristic for Disjoint SOP Synthesis

ABSTRACT. We propose a new heuristic algorithm for Disjoint Sum-of-Products (DSOP) minimization of a Boolean function f, based on a new algebraic criterion for product selection. The basic idea behind the new algorithm is to transform a given irredundant Sum-of-Products (SOP), i.e., a set of products covering the on-set minterms of f, into a disjoint SOP by repeated applications of two transformations. The first transformation selects pairs of suitable overlapping products in the initial SOP and replaces them with pairs of non-overlapping products covering the same minterms. By this step, some products are made disjoint, while keeping the overall number of products in the SOP unchanged. Next, a second transformation returns a completely disjoint SOP. By this second step, the number of products will increase. A set of experiments on a standard collection of combinational benchmarks shows that this new method is efficient and produces better results compared to the current best heuristic, achieving a 34.4% average cost reduction in about the 46% of the benchmarks, with less computation time.

12:00
Resynthesis of logic circuits using machine learning and reconvergent paths

ABSTRACT. Boolean network scoping represents a common approach incorporated in conventional synthesis tools for maintaining the good scalability of the synthesis process. Recently, an approach to the local resynthesis based on combination of evolutionary optimization with the principle of Boolean network scoping has been proposed. Local resynthesis is an iterative process based on the extraction of smaller sub-circuits from a complex circuit that are optimized locally and implanted back to the original circuit. The main advantage of the local resynthesis is that it can mitigate the problem of scalability of representation which is typical to the evolutionary algorithms as the efficiency of the evolutionary optimization applied at the global level deteriorates with the increasing circuit complexity. Unfortunately, the efficiency of local resynthesis is highly influenced by the efficiency of the sub-circuit extraction process. We propose a new method, based on the reconvergent paths identification. The evaluation is done on a set of highly optimized complex benchmark problems representing various real-world controllers, logic and arithmetic circuits. It provides better results compared to the state-of-the-art logic synthesis tool and both locally and globally operating evolutionary optimizations presented earlier. A substantially higher number of redundant gates was removed in more than 70% cases, while keeping the computational effort at the same level. A huge improvement was achieved especially for the controllers. On average, the proposed method was able to remove more than 14.3% of gates. The highest achieved gate reduciton was more than 45% of gates.

12:30
Decomposition of transition systems into sets of synchronizing state machines

ABSTRACT. Transition systems (TS) and Petri nets (PN) are important models of computation ubiquitous in formal methods for modeling systems. An important problem is how to extract from a given TS a PN whose reachability graph is equivalent (with a suitable notion of equivalence) to the original TS.

This paper addresses the decomposition of transition systems into synchronizing state machines (SMs), which are a class of Petri nets where each transition has one incoming and one outgoing arc and all markings have exactly one token. This is an important case of the general problem of extracting a PN from a TS. The decomposition is based on the theory of regions, and it is shown that a property of regions called excitation-closure is a sufficient condition to guarantee the equivalence between the original TS and a decomposition into SMs.

An efficient algorithm is provided which solves the problem by reducing its critical steps to the maximal independent set problem (to compute a minimal set of irredundant SMs) or to satisfiability (to merge the SMs). We report experimental results that show a good trade-off between quality of results vs. computation time.

12:50
Efficient Implementation of Heterogeneous Dataflow Models using Synchronous IO Patterns

ABSTRACT. The synthesis of distributed embedded systems based on desynchronization starts from a synchronous model and keeps its functional behavior while generating a corresponding dataflow process network (DPN). This method is attractive since it supports dynamic behavior modeling while avoiding the verification of the absence of problems like deadlocks and buffer overflows in DPNs which is in general not decidable. However a DPN can be heterogeneous in the sense that different parts may possess either static or dynamic behaviors. A smarter synthesis should automatically generate efficient implementations by exploiting this heterogeneity.

In this paper we improve the desynchronization process by introducing synchronous components with various input/output (IO) patterns which can then be desynchronized to a heterogeneous DPN where each node can be scheduled and executed accordingly. We further designed a synthesis tool chain that automatically synthesizes the heterogeneous DPN to the open computing language (OpenCL) based implementation which is platform-independent and can be deployed on various commercial off-the-shelf (COTS) target platforms.

We demonstrate the effective use of the proposed method by a case study of a smart home automation system (SAS), where different versions based on homogeneous as well as heterogeneous DPNs are generated and evaluated for end-to-end performance.

14:30-15:30 Session 4: KEYNOTE 2 SEAA

Keynote 2 - Dott. Marco D’Ambros - CodeLounge (Switzerland) Research + Industry = R&D - Reflections from a research-industry roundtrip

15:30-16:30 Session 5A: EUROPEAN PROJECTS IN DIGITAL SYSTEMS DESIGN 1
15:30
Programmable Systems for Intelligence in Automobiles (PRYSTINE): Final results after Year 3

ABSTRACT. Autonomous driving has the potential to disruptively change the automotive industry as we know it today. For this, fail-operational behavior is essential in the sense, plan, and act stages of the automation chain in order to handle safety-critical situations by its own, which currently is not reached with state-of-the-art approaches.

The European ECSEL research project PRYSTINE realizes Fail-operational Urban Surround perceptION (FUSION) based on robust Radar and LiDAR sensor fusion and control functions in order to enable safe automated driving in urban and rural environments. This paper showcases some of the key results (e.g., novel Radar sensors, innovative embedded control and E/E architectures, pioneering sensor fusion approaches, AI controlled vehicle demonstrators) achieved until its final year 3.

16:00
Builing Blocks and Interaction Patterns of Unmanned Aerial Systems

ABSTRACT. Drones/UAVs can be used to perform missions that are very tedious or dangerous for humans. Such drones are the air segment of an unmanned aircraft system (UAS), where the UAS also includes a ground segment to control and supervise the drones. Furthermore, the ground and air segments interact with external services that support the missions’ execution. In recent years, a number of UAS architectures has been proposed. However, these architectures do not consider all features (functions) of the system and their interactions, which hinders the development of the drone systems. Therefore, in this paper, we identify the different functions/blocks of the UAS and their interaction patterns to ease its development tasks. The building blocks include functions for flying a drone, enabling safe and efficient drone flight, drone mission management, and their support (e.g., communication, perception, etc.).

15:30-16:30 Session 5B: Architecture and Hardware for Security Applications (AHSA) 1
15:30
Protected Extension of Cryptographic Algorithms on RISC-V

ABSTRACT. Embedded technologies such as IoTs, connected cars or medical equipment are often executed in constrained environment with limited resources. The high demand of security makes cryptography essential. Moreover, the security must consider physical attacks as these objects are physically accessible and can be tampered with. Lightweight Cryptography (LWC) proposes interesting candidates for securing the communications in constrained environments. As many lightweight cryptographic algorithms have been proposed with closed architectures, the features of agility and genericity could be considered. Moreover, a high robustness against side-channel analysis (SCA) is required when the connected object executes sensitive applications or manipulates private data. In this work, we propose the use the Rotating S-Box Masking (RSM) protection as a generic protection that would fit most lightweight block cipher, more specifically those using 4×4 substitution boxes. This protection is developed as an extension to the RISC-V ISA through the use of additional generic instructions. This specific instruction set was implemented on the VexRisc core processor and tested with a protected implementation of the PRESENT cipher. It is easily portable to most nibble-based LWC cipher types. The security analysis of this secure RISC-V processor showed that SCA were impossible with up to 1 million traces.

16:00
Secure and dependable: Area-efficient masked and fault-tolerant architectures

ABSTRACT. Masking is a powerful instrument for protecting cryptographic devices against side-channel analysis. Multiple masking schemes were introduced providing provable security against attacks of arbitrary order even in the presence of glitches. When a device is a~part of some safety-critical system, it needs to meet dependability requirements; therefore, it should be protected against spontaneously occurring faults. Existing commonly used fault-tolerance architectures involve high area overhead as so as the masking schemes do. In this paper, we propose architectures meeting dependability properties of simple modular-redundancy schemes and SCA resistance of masking schemes, but decreasing the area overhead utilizing the randomness involved in the masking schemes.

We compare our Masked Duplex architecture with Triple Modular Redundancy. While using one less redundant module, our architecture saves around 20\% of the area in comparison with TMR in the case of Threshold Implementation of PRESENT cipher, promising more savings for more complex cryptographic schemes.

16:30-17:30 Session 6A: EUROPEAN PROJECTS IN DIGITAL SYSTEMS DESIGN 2
16:30
TEXTAROSSA: Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale

ABSTRACT. To achieve high performance and high energy efficiency on near-future exascale computing systems, a technology gap needs to be bridged: increase efficiency of computation with extreme efficiency in HW and new arithmetics, as well as providing methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research.

17:00
Going to the Edge - Bringing Internet of Things and Artificial Intelligence Together

ABSTRACT. Artificial Intelligence of Things (AIoT) is the natural evolution for both Artificial Intelligence (AI) and Internet of Things (IoT) because they are mutually beneficial. AI increases the value of the IoT through machine learning by transforming the data into useful information, while the IoT increases the value of AI through connectivity and data exchange. Therefore, BLINDED3 – BLINDED BLINDED BLINDED, a pan-European effort with XX key partners from YY countries (EU and ZZ), will provide intelligent, secure and trustworthy systems for industrial applications to provide comprehensive cost-efficient solutions of intelligent, end-to-end secure, trustworthy connectivity and interoperability to bring the Internet of Things and Artificial Intelligence together. BLINDED3 aims at creating trust in AI-based intelligent systems and solutions as a major part of the AIoT. This papers provides an overview about the concept and ideas behind BLINDED3 and introduces the BLINDED3 reference architecture for infrastructure organization of AIoT use cases.

16:30-17:30 Session 6B: Architecture and Hardware for Security Applications (AHSA) 2
16:30
Studying OpenCL-based Number Theoretic Transform for heterogeneous platforms

ABSTRACT. Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utilized to achieve better performance for these functionalities, where polynomials are needed to be multiplied. NTT simplifies the multiplication overhead allowing point-wise multiplication by transforming the polynomials into the spectral domain and then inversing the result to the original domain. It is important to optimize this technique that is used in a wide range of computing systems. In this paper we study the feasibility of using OpenCL, a portable framework, to implement a parallelized version of NTT which allows deployment on heterogeneous platforms, such as Graphic Processing Units (GPU) and Field Programmable Gate Arrays. We measure the performance of our implementation on a GPU and evaluate when and where such a deployment if beneficial. Our results showed that the proposed parallel implementation is a viable acceleration approach for these algorithms for lattice-based cryptography solutions.

17:00
Novel non-cryptographic hash functions for networking and security applications on FPGA

ABSTRACT. This paper proposes the design and FPGA implementation of five novel non-cryptographic hash functions, that are suitable to be used in networking and security applications that require fast lookup and/or counting architectures. Our approach is inspired by the design of the existing non-cryptographic hash function Xoodoo-NC, which is constructed through the concatenation of several Xoodoo permutations. We similarly construct non-cryptographic hash functions based on the concatenation of several rounds of symmetric-key ciphers. The goal is to achieve high performance in combination with good avalanche properties, which are required in order to have a significant change in the output value as a result of a limited change in the input value. We simulate how many rounds are needed to achieve satisfactory avalanche scores and we implement the corresponding non-cryptographic hash functions on an FPGA to evaluate the occupied resources and the performance. One of the proposed non-cryptographic hash functions, namely GIFT-NC, outperforms all previously proposed non-cryptographic hash functions in terms of throughput and latency, in exchange for an acceptable increase in FPGA resources.

17:30-19:00 Session 7A: EUROPEAN PROJECTS IN DIGITAL SYSTEMS DESIGN 3
17:30
AIDOaRt: AI-augmented automation for DevOps, a model-based framework for continuous development At RunTime in cyber-physical systems

ABSTRACT. With the emergence of Cyber-Physical Systems (CPS), the increasing complexity in development and operation demands for an efficient engineering process. In the recent years DevOps promotes closer continuous integration of system development and its operational deployment perspectives. In this context, the use of Artificial Intelligence (AI) is beneficial to improve the system design and integration activities, however, it is still limited despite its high potential. AIDOaRT is a 3 years long H2020-ECSEL European project involving 32 organizations, grouped in clusters from 7 different countries, focusing on AI-augmented automation supporting modelling, coding, testing, monitoring and continuous development of Cyber-Physical Systems (CPS). The project proposes to apply Model-Driven Engineering (MDE) principles and techniques to provide a framework offering proper AI-enhanced methods and related tooling for building trustable CPSs. The framework is intended to work within the DevOps practices combining software development and information technology (IT) operations. In this regard, the project points at enabling AI for IT operations (AIOps) to automate decision making process and complete system development tasks. This paper presents an overview of the project with the aim to discuss context, objectives and the proposed approach.

18:00
The H2020-ECSEL Project ”iRel40” (Intelligent Reliability 4.0)

ABSTRACT. Building on many discoveries and inventions, electronics started affecting people’s everyday lives in a significant fashion following the invention of the first solid state transistor in late 1940s. The miniaturization paved way for the mass electronics production and later the digital revolution, the outcomes of which are visible to all members of the public today. After about a two-decade-long swing around 2000s from hardware towards software regarding what affects lives more, a point has now been reached where electronics is more important to all and its use is more ubiquitous and crucial than ever before. In most if not all of end user or industrial applications, the capability and quality of electronics hardware are the key determining factors. The European electronics components and systems (ECS) industry has traditionally had a high base line for electronics innovation. However, the industry is now compelled, partly due to competition and partly due customer demand, to manufacture even more reliable electronics products than before. Guaranteeing the reliability of electronics hardware entails the entire ECS value chain to undergo a paradigm shift to holistically address reliability as a key issue. The European ECS industry previously adopted overseas outsourcing considerably, however it is now taking steps to reshape itself into a more coherent value chain with the aim of having not only the electronics designs but also the electronics manufacturing made in Europe. H2020-ECSEL programme successfully funds highly competitive projects in the area of electronics components and systems. We present here a prologue to a similarly funded project entitled Intelligent Reliability 4.0 (“iRel40”), by providing a background to the topic of ECS, project objectives and the methodologies and implementations we plan to undertake during the 36-month period of this ongoing project.

18:30
Pre-Integrated Architectures for sustainable complex Cyber-Physical Systems

ABSTRACT. The paradigm of Cyber-Physical Systems is spreading widely across several industrial domains such as Automotive, Construction, Energy, Health, Manufacturing, Smart Cities. But the system architectures, processes and operations related to these Cyber-Physical Systems are reaching a high level of global complexity, which is difficult to sustain by the different stakeholders. In addition, new ambitious constraints are being added to the list of requirements that these Cyber-Physical Systems must comply with. The purpose of this paper is to propose the concept of pre-integrated architectures as solutions to improve the development and operational processes of these complex Cyber-Physical Systems. An outlook of the practical implementations and impacts in four industrial domains will be provided, in relationship with the developments performed in the CPS4EU project.

17:30-18:30 Session 7B: Architecture and Hardware for Security Applications (AHSA) 3
17:30
MaDMAN: Detection of Software Attacks Targeting Hardware Vulnerabilities

ABSTRACT. The increasing complexity of modern microprocessors created new attack areas. Attackers exploit these areas using Software Attacks Targeting Hardware Vulnerabilities (SATHV) such as Cache Side-Channel, Spectre, and Rowhammer attacks. These attacks target the microarchitecture to extract privileged information. As their target is the hardware, antivirus programs cannot detect them. But, they modify the normal behavior of the microarchitecture. Modern systems are equipped with hardware performance counters (HPCs), which measure events related to hardware components. Designers can take advantage of these counters to monitor and protect the system. In the literature, there exist many solutions that use HPCs to detect SATHV. But, due to the limited number of counters, proposed solutions only protect the microprocessor against a limited set of SATHV. In contrast, we propose MaDMAN, a Malware Detector, which gathers information from HPCs to detect a large set of SATHV. MaDMAN uses a Logistic Regression classifier. In our threat model, we include Cache Side-Channel, Rowhammer, and Spectre SATHV. Our detection mechanism succeeds to detect these attacks with 98.96% accuracy, 96.3% F-score, and 0% false positive rate. In addition, MaDMAN works in noisy environments and can detect successfully evasive malware.

18:00
Analysis of a Laser-induced Instructions Replay Fault Model in a 32-bit Microcontroller

ABSTRACT. In this paper, we present a method to obtain a new Laser Fault Injection (LFI) induced fault model: replay of instructions on a 32-bit Microcontroller (MCU). This method allows a potential adversary to replay a block of two or four instructions with a fault rate up to 100%. These faults are induced by laser pulses and cause the instructions updating process of a Flash buffer to fail. As a result, the new instructions failing to be stored in the Flash buffer, the previous ones are replayed. We deeply studied the properties of this replay fault model by many experiments of laser fault injections. We have notably shown that the sensitivity window is proportional to the laser Pulse Width (PW), and that up to 20 instructions in a row were tested to be overwritten due to replaying five times the block of four instructions. The effects of the laser power and cache status (enabled or disabled) are also presented. Finally, we proposed and assessed a simple method to detect the LFI-induced replay faults using a hardware counter with different increments. Our results extend the ability of LFI on MCU, illustrating the accuracy and reproducibility of LFI.