DSD 2024: 2024 EUROMICRO DIGITAL SYSTEM DESIGN CONFERENCE
PROGRAM FOR WEDNESDAY, AUGUST 28TH
Days:
next day
all days

View: session overviewtalk overview

10:45-12:15 Session 1A: Emerging Technologies and AI
Location: Room 109
10:45
Parameter Space Exploration of Neural Network Inference Using Ferroelectric Tunnel Junctions for Processing-In-Memory

ABSTRACT. This paper explores CMOS-compatible Ferroelectric Tunnel Junctions (FTJs) for processing-in-memory (PIM) to address the 'memory wall' in traditional computing. A novel FTJ noise model was developed, and hardware-calibrated devices were modeled utilizing IBM's Analog In-Memory Hardware Acceleration Toolkit (AIHWKit). We simulate FTJ-based neural networks (NN) for inference only, focusing on mitigating non-idealities such as conductance drift, programming noise, and 1/f read noise. To mitigate FTJ non-idealities affecting different NNs' accuracy, we developed various hardware-aware (HWA) training techniques including, but not limited to, different weight redistribution, noise resiliency, and custom activation functions. Exploiting the HWA techniques used in this paper, considering different weight-to-conductance mapping and conductance and weight range enlargement, depicts an accuracy improvement in two case studies: full adder implementation and MNIST benchmark, with best-case scenario accuracy of 96.96% and 86.92%, respectively, after four months of inference. Furthermore, scalability, accuracy improvement, and experimental validation confirm FTJs' potential for PIM, providing valuable insights for future device and circuit design optimizations to enhance performance and reliability.

11:15
Design Objectives for Synthesis of Graphene PN Junction Circuits based on Two-level Representation

ABSTRACT. The development of electrostatically doped graphene PN-junctions shows promise for creating efficient low-power, high-speed circuits. In recent years, there has been a considerable interest in the synthesis of graphene PN junction logic circuits. However, existing synthesis methods lack assessment based on technology-specific cost metrics (e.g., number of graphene PN junction gates, constant inputs), leading to insufficiently addressed design objectives. In this paper, we introduce synthesis approaches for graphene PN-junction circuits based on Sum-of-Products (SoP) and Exclusive Sum-of-Products (ESoP) function representations. Experimental results indicate that ESoP-based synthesis significantly reduces the number of graphene PN junction gates, constant inputs, and switching activity compared to SoP-based approaches. Overall, ESoP-based synthesis is deemed more suitable than SoP-based methods for designing graphene PN-junction logic circuits.

11:30
Exploiting PSOP Decomposition for Quantum Synthesis – WiP

ABSTRACT. The synthesis strategy for quantum oracles is based on a reversible logic synthesis and a quantum compilation step. In reversible logic synthesis it is important to obtain a compact reversible circuit in order to minimize the size of the final quantum circuit. Projected Sum Of Product, PSOP, decomposition is an EXOR based technique that can be applied to any Boolean function as a very fast pre-processing step for further minimizing the circuit area in standard logic synthesis. In this paper, we exploit PSOP decomposition in quantum synthesis. In particular, we describe a new technique for the quantum synthesis of PSOP decomposed functions. The experimental results validate the proposed pre-processing method in quantum synthesis, showing an interesting gain in area, within the same time limit.

11:45
Hardware Acceleration of Capsule Networks for Real-time Applications

ABSTRACT. Capsule networks (CapsNet) are a category of deep learning neural networks (DNN) that address one of the main issues and deficiencies of convolutional neural networks (CNN); loss of spatial information in pooling layers. However, the main concern with CapsNets is their compute-intensive nature, which is mainly related to the vector-based calculations during dynamic routing and acts as a barrier for their deployment in real-time applications. To address the specific computing requirements of dynamic routing in capsule layers of CapsNets, we develop a hardware accelerator for dynamic routing using Vitis HLS. This paper presents a hardware acceleration solution for capsule networks by integrating AMD Xilinx deep processing unit (DPU) and a custom accelerator for the capsule layer. Our results show significant improvement in throughput compared with the baseline implementation when CapsNet is implemented on Zynq UltraScale+ MPSoC ZCU102 using Vitis AI DPUs.

10:45-12:15 Session 1B: Architectures and Hardware for Security Applications - 1
Location: Room 106
10:45
Counter power leakage for frequency extraction of ring oscillators in ROPUF

ABSTRACT. This paper deals with power side-channel analysis to extract frequencies of ring oscillators (ROs) to enable an attack on a ring oscillator based physical unclonable function (ROPUF). Side-channel attacks against ROPUFs exist, though they require costly equipment and are difficult to carry out. In this paper, we devise a method that uses a power side-channel to extract RO frequencies through counter leakage using less resources. This method allows to derive the PUF response of some ROPUF constructions, thus defeating them. We show the side-channel leakage on a minimal design consisting of a single RO and a counter. Then we explore the dependence of the leakage on FPGA routing and counter type and demonstrate the attack method on a ROPUF implemented on a Xilinx Artix-7 FPGA.

11:15
A Runtime-Accessible True Random Number Generator Based on Commercial Off-The-Shelf Resistive Random Access Memory Modules

ABSTRACT. In this work, we propose a novel runtime-accessible True Random Number Generator (TRNG) that is implemented on Commercial Off-The-Shelf (COTS) Resistive Random Access Memory (ReRAM) modules. Modules of two different manufacturers (Adesto Technologies and Fujitsu) have been tested, providing almost equally good results. In particular, the proposed TRNG can successfully pass all the tests of the NIST SP800-22 Statistical Test Suite at a wide range of temperatures. At the same time, the TRNG achieves a throughput of at least 28 bits per second under adverse temperature conditions and approximately 51 bits per second at room temperature in the worst case, which is sufficient for many practical applications such as security protocols for the Internet of Things (IoT) and in-vehicle networks.

11:30
COSOI: True Random Number Generator Based on Coherent Sampling using the FD-SOI technology - WiP

ABSTRACT. This work presents a proof of concept of the implementation of a Coherent Sampling Ring Oscillator TRNG (COSO-TRNG) using the Fully Depleted Silicon On Insulator (FD-SOI) technology. COSO-TRNG appears as one of the best structures optimizing the throughput per area trade-off and having a model for its entropy source. The back-biasing capability of the FD-SOI technology proved here to be a very simple and efficient technique for the ring oscillator frequency calibration needed for the coherent sampling method. This is the first demonstration of feasibility of COSO-TRNG validated on ASIC FD-SOI 22nm. A throughput of 3.36 Mbits/s was obtained, equivalent to results in the literature.

11:45
Exploring Fault Injection Attacks on CVA6 PMP Configuration Flow

ABSTRACT. Fault-Injection Attacks (FIA) pose significant threats to the security and reliability of embedded systems. FIAs can be used to target an embedded processor by manipulating its clock signal or power supply or by using electromagnetic pulses. In this study, we analyse FIA on the Physical Memory Protection (PMP) configuration flow of a CVA6 RISC-V core. Fault injection campaigns targeting an FPGA on an ARTY A7-100T board are performed to characterize the fault effects. For that purpose, we rely on clock glitches. Moreover, in order to further characterize the induced faults, Error-Correction Code (ECC) is considered. We extend each pipeline stage with hardware modules to detect and eventually correct faults using Hamming code.

Experimental results demonstrate that FIA has multiple effects on the PMP configuration registers. By classifying these effects in regards with injection parameters, we highlight that a given effect can be obtained with high probability by an attacker. Furthermore, thanks to integrated ECC modules used as probes, we confirm that single bit-flit is a prevalent effect in our experiments. Particularly, results demonstrate that numerous fault effects observed in the PMP configuration registers are caused by single bit-flips in the ID stage of the CVA6 core.

12:00
Impact of Compiler Optimization Flags on Side-Channel Information Leakage of SipHash algorithm

ABSTRACT. This work presents an experimental evaluation of influence of compiler optimization flags on side-channel information leakage. SipHash was used as a reference algorithm an ARX-based pseudorandom function optimized for short inputs. ChipWhisperer CW308 with various targets was used for the evaluation using guessing entropy of CPA and Welch's t-test. The main contribution of this paper is analysis of impact of each flag and its suitability for implementations minimizing side-channel leakage.

10:45-12:15 Session 1C: Dependability, Testing and Fault Tolerance in Digital Systems
Location: Room 116
10:45
SAT can Ensure Polynomial Bounds for the Verification of Circuits with Limited Cutwidth

ABSTRACT. As hardware designs are getting more complex, verification becomes ever more important to prevent producing chips which do not behave according to their specification. This increasing complexity also impacts the verification process, resulting in a longer time-to-market. Ensuring that the verification itself can be conducted efficiently helps facing these challenges. Additionally, taking the efficient verification into consideration during the design phase further enables the optimization of the whole process. In this paper, we present a SAT-based verification flow and how it can ensure polynomial bounds for the verification of circuits with limited cutwidth. To demonstrate our approach, the flow is applied to three different adder architectures. Addition is one of the most essential operations in digital computations and the simplicity of its circuit realizations makes it a good starting point to explore their efficient verification using SAT. We provide theoretical proofs that SAT can be used for Polynomial Formal Verification (PFV) of circuits with limited cutwidth. We then show that for the considered adder circuits, a linear time complexity of the verification process can be ensured and confirm our findings by experimental evaluation with our own SAT solver.

11:15
Studying the Degradation of Propagation Delay on FPGAs at the European XFEL

ABSTRACT. An increasing number of unhardened commercial-off-the-shelf embedded devices are deployed under harsh operating conditions and in highly-dependable systems. Due to the mechanisms of hardware degradation that affect these devices, ageing detection and monitoring are crucial to prevent critical failures. In this paper, we empirically study the propagation delay of 298 naturally-aged FPGA devices that are deployed in the European XFEL particle accelerator. Based on in-field measurements, we find that operational devices show significantly slower switching frequencies than unused chips, and that increased gamma and neutron radiation doses correlate with increased hardware degradation. Furthermore, we demonstrate the feasibility of developing machine learning models that estimate the switching frequencies of the devices based on historical and environmental data.

11:45
Influence of Structural Units on Vulnerability of Systems with Distinct Protection Approaches

ABSTRACT. Mission/safety-critical applications rely on the dependability of their semiconductor control systems. The lockstepping is state-of-the-art when it comes to their protection. This paper compares the dual/triple lockstep systems with the Hardisc, which features microarchitecture-level protection. For a fair comparison, each system is based on the same architecture and integrates protection against bit-flips in memory and transient faults in the bus. We analyse how individual structures of the system affect the vulnerability of a system to faults. Our fault injection methodology combines pre-synthesis simulation with synthesis data to scale the fault rate of the structures according to their size. The results show that the Hardisc can withstand fault rates orders of magnitude higher than the dual-core lockstep system while preserving the same area and power consumption. It comes with a 5% frequency penalty. The pipeline logic has the most significant impact on the vulnerability of the lockstep system. On the other hand, the ECC-protected register file constraints the maximum failure rate the Hardisc can withstand due to fault accumulation during specific program sections.

12:00
A Reconfigurable Approximate Computing RISC-V Platform for Fault-Tolerant Applications

ABSTRACT. The demand for energy-efficient and high-performance embedded systems drives the evolution of new hardware architectures, including concepts like approximate computing. This paper presents a novel reconfigurable embedded platform named "phoeniX", using the standard RISC-V ISA, maximizing energy efficiency while maintaining acceptable application-level accuracy. The platform enables the integration of approximate circuits at the core level with diverse structures, accuracies, and timings without requiring modifications to the core, particularly in the control logic. The platform introduces novel control features, allowing configurable trade-offs between accuracy and energy consumption based on specific application requirements. To evaluate the effectiveness of the platform, experiments were conducted on a set of applications, such as image processing and Dhrystone benchmark. The core with its original execution engine, occupies 0.024mm² of area, with average power consumption of 4.23mW at 1.1v operating voltage, average energy-efficiency of 7.85pJ per operation at 620MHz frequency in 45nm CMOS technology. The configurable platform with a highly optimized 3-stage pipelined RV32I(E)M architecture, possesses a DMIPS/MHz of 1.89, and a CPI of 1.13, showcasing remarkable capabilities for an embedded processor.

14:45-16:15 Session 2A: Networking
Location: Room 109
14:45
APEnetX: a custom NIC for cluster interconnects

ABSTRACT. The APEnet project is an established initiative aimed at designing interconnection boards based on FPGAs and tailored for use in HPC clusters whose nodes are arranged in a 3D toroidal network topology. Based on a custom communication protocol and network IPs, deployments of an APEnet NIC can be configured for different operating environments. In this work, we describe APEnetX, the latest version of network architecture in the APEnet family, developed on Alveo U200 boards which leverage the 16nm technology devices in the Ultrascale+ line by Xilinx. APEnetX features a PCIe Gen3 x16 interface driven by the QDMA Xilinx IP and adheres to the RDMA semantics, ensuring efficient high bandwidth data transfer without the involvement of the host operating system. The communication between adjacent nodes is handled through a proprietary lightweight protocol and enabled by the high-speed serial embedded transceivers of the FPGA, while a custom full crossbar switch and an optimized router allow for the implementation of a direct torus topology. Together with a proprietary software driver, a low-level communication library and a dedicated MPI library, APEnetX lines up with current application standards in HPC systems. In addition, the release of an APEnetX network simulator allows for testing network functionalities at large scale. The achievement of this work is twofold: we enhance the performance of the previous APEnet cards reaching a microsecond user-level latency over the PCIe for small packet transmission between topologically adjacent host nodes; we define the requirements of the communication generated by a reference spiking neural network simulator (NEST) to drive the co-design of future APEnet generation.

15:15
FlexCross: High-Speed and Flexible Packet Processing via a Crosspoint-Queued Crossbar

ABSTRACT. The fast pace at which new online services emerge leads to a rapid surge in the volume of network traffic. A recent approach that the research community has proposed to tackle this issue is in-network computing, which means that network devices perform more computations than before. As a result, processing demands become more varied, creating the need for flexible packet-processing architectures. State-of-the-art approaches provide a high degree of flexibility at the expense of performance for complex applications, or they ensure high performance but only for specific use cases. In order to address these limitations, we propose FlexCross. This flexible packet-processing design can process network traffic with diverse processing requirements at over 100 Gbit/s on FPGAs. Our design contains a crosspoint-queued crossbar that enables the execution of complex applications by forwarding incoming packets to the required processing engines in the specified sequence. The crossbar consists of distributed logic blocks that route incoming packets to the specified targets and resolve contentions for shared resources, as well as memory blocks for packet buffering. We implemented a prototype of FlexCross in Verilog and evaluated it via cycle-accurate register-transfer level simulations. We also conducted test runs with real-world network traffic on an FPGA. The evaluation results demonstrate that FlexCross outperforms state-of-the-art flexible packet-processing designs for different traffic loads and scenarios. The synthesis results show that our prototype consumes roughly 21% of the resources on a Virtex XCU55 UltraScale+ FPGA.

15:30
Achieving High-Throughput with a Trainable Neural-Network-Based Equalizer for Communications on FPGA

ABSTRACT. The ever-increasing data rates of modern communication systems lead to severe distortions of the communication signal, imposing great challenges to state-of-the-art signal processing algorithms. In this context, neural network (NN)-based equalizers are a promising concept since they can compensate for impairments introduced by the channel. However, due to the large computational complexity, efficient hardware implementation of NNs is challenging. Especially the backpropagation algorithm, required to adapt the NN’s parameters to varying channel conditions, is highly complex, limiting the throughput on resource-constrained devices like field programmable gate arrays (FPGAs). In this work, we present an FPGA architecture of an NN-based equalizer that exploits batch-level parallelism of the convolutional layer to enable a custom mapping scheme of two multiplication to a single digital signal processor (DSP). Our implementation achieves a throughput of up to 20 GBd, which enables the equalization of high-data-rate nonlinear optical fiber channels while providing adaptation capabilities by retraining the NN using backpropagation. As a result, our FPGA implementation outperforms an embedded graphics processing unit (GPU) in terms of throughput by two orders of magnitude. Further, we achieve a higher energy efficiency and throughput as state-of-the-art NN training FPGA implementations. Thus, this work fills the gap of high-throughput NN-based equalization while enabling adaptability by NN training on the edge FPGA.

15:45
A hardware accelerator for quantile estimation of network packet attributes

ABSTRACT. Measuring statistical properties of network traffic can improve our understanding of traffic distribution and help us detect short and long-term anomalies. However, computing the exact value of these properties requires significant storage and computation, which limits their application in high-speed networks. Hardware accelerators provide the computational power to process a large sequence of network packets with high throughput and low latency, but their performance is ultimately limited by the amount of on-chip memory available on the device. Consequently, researchers have proposed sketch-based algorithms to estimate properties of a data stream with sublinear memory and theoretical estimation error bounds. In this paper, we present a streaming algorithm and hardware accelerator for quantile estimation, which is based on the architecture of the KLL sketch. Implemented on an AMD Virtex XCU55 UltraScale+ FPGA, the accelerator operates at a clock frequency of 356 MHz, thereby achieving a minimum line rate of 182 Gbps. When processing a set of 10 real traffic traces of up to 123 million packets, the accelerator estimates 1000 packet-size quantiles per trace with a median error of 0.39% or less, and a maximum error of 1.3% or less across all traces.

16:00
Agile Design-Space Exploration of Dynamic Layer-skipping in Neural Receivers

ABSTRACT. Dynamic Neural Networks (DyNN) adapt their structure during runtime for improved performance and/or lower power consump- tion. DyNNs seem to benefit neural-network-based wireless re- ceivers, or in short neural receivers, that adapt their performance under varying channel conditions. However, DyNN hardware ar- chitectures are not yet explored sufficiently in the literature as this would require a framework for simultaneously evaluating system- level performance and accurate hardware Power-Performance-Area (PPA) under different dynamic scenarios. This paper presents two main contributions: (1) An automated framework that merges a system-level performance evaluation model of a wireless system and a High-Level Synthesis (HLS) tool for agile Design-Space Explo- ration (DSE) of DyNNs. (2) A novel DyNN architecture for neural receivers in wireless systems that skips a variable number of layers based on varying channel conditions. Our proposed neural receiver architecture with layer-skipping for a Single Input Multi Output (SIMO) wireless system shows power consumption savings of up to 59.2% in 22 nm FD-SOI technology compared to a static network

14:45-16:15 Session 2B: Architectures and Hardware for Security Applications - 2
Location: Room 106
14:45
How Primitive but How Effective: Fault-Injection Attack on Cryptographic Accelerator of Microchip CEC 1702 Microcontroller

ABSTRACT. Fault injection attacks pose a substantial threat to the security of digital systems, compromising integrity and exposing vulnerabilities. This article explores fault injection techniques, specifically voltage glitching, within the context of Lenstra’s attack, which is an extension of the Bellcore attack, on RSA-CRT, which has existed for decades. The focus is on the cryptographic accelerator of the Microchip CEC 1702 microcontroller. The study employs the ChipWhisperer toolkit to perform fault injection attacks on both software and hardware implementations of RSA-CRT. Results reveal vulnerabilities in the commercially produced Microchip CEC 1702 microcontroller, highlighting potential security risks associated with fault injection attacks.

15:15
Securing Elapsed Time for Blockchain: Proof of Hardware Time and some of its Physical Threats

ABSTRACT. Blockchain technology enables the creation of a time-stamped, shared, and replicated history of events among participants who do not trust each other. To agree on the shared history, the blockchain uses a consensus protocol, such as Nakamoto's protocol in Bitcoin. This protocol relies on a proof that ensures the elapsed time between two blocks by design with the Proof of Work mechanism. This paper focuses on the use of hardware security components to guarantee the time elapsed between two events by design and at low-power. It introduces a Proof of Hardware Time (PoHT), a hardware-based proof of elapsed time compatible with embedded systems and based on a System on Module (SoM) that features an ARM Cortex-A7 processor with a TrustZone and a Trusted Platform Module. The issue of modifying the digital elapsed time measured by hardware components is examines in relation to real elapsed time. It considers whether it is possible to alter this measurement by carrying out hardware attacks on the target. Two main dreaded events are considered: compression and elongation of the digital value of the elapsed time. Experimental attacks on the SoM are performed in terms of temperature and supply voltage targeting clock oscillators and time measurement functions, after which a security analysis and a discussion of possible countermeasures are presented.

15:45
Automatic Generation of modular multipliers upon pseudo Mersenne primes using DSP blocks on FPGAs

ABSTRACT. Modular multiplication plays a crucial role and often forms the critical path in the execution of complex cryptographic algorithms such as Elliptic Curve Cryptography (ECC) and Homomorphic Encryption. Execution in ECC primarily requires multiplication and addition in $GF(p)$. On the other hand, homomorphic encryption, a cryptographic technique that allows arithmetic operations over encrypted data, requires addition and multiplication in ring $R_q=Z_q[x]/(x^n+1)$, which in turn requires coefficient multiplication over the integer $q$. Efforts to enhance the efficiency of modular multipliers have been extensively pursued due to the pivotal role they play in the critical path of various cryptographic operations. In recent times, FPGAs containing DSP blocks have found to be capable of efficiently performing various mathematical operations including modular multipliers of different cryptographic algorithm. But designing multipliers over FPGAs stand tricky as the operand sizes supported by the DSPs are asymmetric in nature. In this paper, we have created a tool that can autonomously generate efficient hardware designs to support execution of modular arithmetic upon pseudo-Mersenne primes. Using the tool, we generated modular multiplier designs for various primes that support several elliptic curves and also for several Residue Number Systems (RNS), supporting a large prime. Our tool generated hardware implementations' resource requirements stands out to be significantly efficient than existing implementations in terms of critical path delay and resource utilisation.

16:00
Design and Evaluation of Combined Hardware FIA and SCA Countermeasures for AES Cipher

ABSTRACT. The development of hardware countermeasures against hardware attacks has attracted the attention of the scientific community over the past two decades. Side channel analysis (SCA) and fault injection analysis (FIA) attacks represent a significant risk to the security of cryptographic circuits. Combined countermeasures against these types of attack must be carefully designed if both protections are required in the same implementation without interfering with the other. In this paper, a combined proposal is presented to counteract the SCA and FIA attacks. SCA countermeasures are based on a low-entropy masking scheme, while the FIA countermeasure is based on a signature generator detection scheme. Both have been experimentally implemented in an AES cipher in FPGA. Experimental FIA attacks, first-order CPA attacks, and Test Vector Leakage Assessment (TVLA) tests have been carried out to evaluate the improvement in the security levels. The experimental results show that for an increase in area 28.77% and a frequency degradation of 4.05% with respect to the unprotected implementation, the protected AES scheme shows great security metrics against FIA (99.87% fault coverage) and SCA attacks.

14:45-16:15 Session 2C: Digital Design and Verification with Chisel
Location: Room 116
14:45
Teaching Agile Hardware Design with Chisel

ABSTRACT. Agile hardware design techniques take the best of software engineering methods and apply them to improve hardware design productivity. Agile approaches not only reduce the time to solution, but they can also produce solutions which are better tailored for their target problems. Chisel provides the perfect vehicle to teach these techniques as it allows for the creation of reusable hardware generators. In this work, we outline our experiences creating an agile hardware design course using Chisel, and the lessons learned from teaching it 4 times. All of the course materials are available as open source.

15:15
Hardware Generators with Chisel

ABSTRACT. Most digital hardware is currently described in the so-called hardware description languages, such as VHDL and (System)Verilog. These languages provide limited programming models for hardware construction despite receiving regular updates and extensions. Chisel defines itself as a hardware construction language, which means it shall permit more than the mere description of digital circuits. However, programmatic hardware generation is not new. Scripting languages like Perl generate VHDL or Verilog code from sources like Excel spreadsheets. Chisel, being embedded in the general-purpose language Scala, lends itself to writing hardware generators in that language.

In this paper, we consider the Chisel-Scala ecosystem an ideal starting point for programming hardware generators and explore various programming models for this purpose. We are confident that proven technologies from the software development world can be leveraged in the hardware design domain to improve hardware designers' productivity to build the next billion transistor chips.

15:45
DSL-based SNN Accelerator Design using Chisel

ABSTRACT. Spiking Neural Networks (SNNs) are a promising class of algorithms for hardware acceleration, even outperforming traditional neural networks in some cases. However, existing SNN accelerator approaches do not perform exhaustive explorations of possible network parameters, including neuron models and spike codings; instead, they often focus on a single network setting and a given fixed hardware architecture for its implementation. Chisel is a hardware construction language that allows the modeling of high-level abstractions from the Register-Transfer Level (RTL) and above. It promises a more transparent and predictable approach than compiler-based methodologies, like High-Level Synthesis (HLS). In this paper, we propose a novel multi-layer Domain-Specific Language (DSL) for SNN accelerator design based on Chisel, allowing for design space explorations that vary neuron models, spike codings, reset behaviors, and even accelerator topologies. Moreover, we propose an SNN accelerator generation framework using this DSL, which covers training to deployment. We explore and evaluate implementations and provide results regarding execution time, Field-Programmable Gate Array (FPGA) resource usage, power consumption, and accuracy.

16:15-17:00 Session 3: Coffee Break + Poster Session – Auditorium Hall
An Open-source Fully Parameterizable Fully Synthesizable Matrix Multiplication Library for modern AMD FPGAs - WiP

ABSTRACT. One common characteristic of High-Performance Computing (HPC) and Cyber-Physical Systems (CPS) is their need for heterogeneous energy-efficient solutions. In this work we present a library for FPGA-accelerated dense matrix multiplication which is flexible, open-source and has no dependencies on the actual hardware implemention tools. Our library is designed so as to support arbitrary array sizes and accuracy, making it a versatile and adaptable solution that meets the diverse computational requirements of applications all the way from CPS to HPC. The library is complimented by an innovative optimization flow that (i) minimizes the crucial accesses to external Dynamic Random-Access Memories (DRAMs) and/or state-of-the-art High Bandwidth Memories (HBMs), (ii) optimizes the on-chip memory accesses, (iii) reduces the kernel computations' latency and (iv) takes full advantage of the resources of both small and large modern FPGAs. Our approach fills a significant gap in the open-source domain by providing a comprehensive approach for accelerating FPGA-tailored matrix multiplication, simplifying this complex, and very often cumbersome, process which is inherited in a wide range of applications from scientific simulations to machine learning and data analytics. One significant benefit of such a synthesizable library is that it is a straightforward and adaptable solution which efficiently exposes the flexibility and performance of the FPGAs to both novice and expert developers while this is not the case with the black-box libraries provided by the FPGA manufacturers. Our approach has been evaluated in a number of state-of-the-art AMD FPGAs; the end results demonstrate that the presented implementations can achieve 9x, 34x and 3x gains, in terms of energy efficiency, when compared with embedded, high-end CPUs and GPUs respectively. Moreover our solution matches or slightly outperforms the most advanced similar FPGA-tailored approach while also being much more flexible and designer-friendly while also library-independent.

Immersive Environments with Haptic Technology for the Control of an Industrial Robotic Arm - WIP

ABSTRACT. Immersive Environments with Haptic Technology for the Control of an Industrial Robotic Arm

Culsans: An Efficient Snoop-based Coherency Unit for the CVA6 Open Source RISC-V application processor - WIP

ABSTRACT. Culsans: An Efficient Snoop-based Coherency Unit for the CVA6 Open Source RISC-V application processor

17:00-18:30 Session 4A: Networking and Application of AI
Location: Room 109
17:00
ecoNIC: Saving Energy through SmartNIC-based Load Balancing of Mixed-Critical Ethernet Traffic

ABSTRACT. In next-generation automotive, industrial, data center, and other mixed-critical networks, Ethernet is expected to power the backbone interconnect among multi-core compute nodes. On attached Network Interface Cards (NICs) Receive Side Scaling (RSS) supports the CPU in balancing workloads across cores for reduced tail latencies. However, state-of-the-art solutions are primarily designed for performance and less for energy-efficiency which will play an equally important role. For this reason we present ecoNIC, an RSS-based hardware load balancer for SmartNICs, and an agile Dynamic Voltage and Frequency Scaling (DVFS) governor, for energy-saving network processing. ecoNIC efficiently pins flow priorities to CPU core clusters, reducing the workload of select cores in the process, and dynamically adjusts their clock speed to exploit freed-up capacities and save energy. Within a cluster, it proactively redirects packet bursts of priority-separated flow bundles among available cores, or offloads them to neighbor nodes, once local resources tend to become highly loaded. The per-core energy consumption this way is reduced at the expense of low-priority packet latencies, while high priority service qualities are maintained. Experimental evaluations applying real-world network traces yield energy savings of up to 37.9% at an increase from 559us to 3.06ms in low-priority end-to-end tail latency.

17:30
Exploiting Virtual Layers and Pruning for FPGA-based Adaptive Traffic Classification

ABSTRACT. Traffic classification is crucial to many network administration tasks, from resource management to QoS monitoring. While DNNs are the state-of-the-art method for classifying traffic, they are computationally intensive, posing challenges to their adoption within the networking infrastructure. One popular alternative is exploiting the FPGAs in the Smart Network Interface Cards (SmartNICs) to speed up the DNN inference processing. In this context, there are two main obstacles. First, modern networks experience high-volume and very volatile traffic flow, making the design of in-network accelerators difficult. Second, the use of FPGA-enabled SmartNICs involves reconfiguration when changing the classification task, which leads to significant time and energy overheads. In this work, we propose two different but complementary solutions to the aforementioned challenges: the use of pruning, which dynamically removes parts of the DNN to speed up its processing at the cost of controlled accuracy drops, therefore adapting the inference processing to the constantly changing traffic; and Hardware Virtual Layers (HWVL), which eliminate the need for FPGA reconfigurations for seamless and almost instantaneous task switching. Both approaches are combined in the Spyke Framework, improving throughput in up to 1.46X and reducing the energy per inference in up to 1.37X compared to a state-of-the-art FPGA accelerator.

18:00
CNN-LSTM implementation methodology on SoC FPGA for Human Action Recognition based on Video

ABSTRACT. The growing use of AI-driven video applications like surveillance or healthcare monitoring underscores the need for embedded solutions capable of accurately categorizing human actions in real-time videos. A methodology is proposed for implementing a customized CNN-LSTM architecture on AMD-Xilinx SoC FPGA devices for human action categorization from video data. In this approach, CNN operations are accelerated by the Vitis-AI DPU within the FPGA, offering flexibility to support a range of CNN architectures without requiring individual hardware description language development. This adaptability is crucial given the varying performance of CNN models across datasets. LSTM oper- ations are executed on the SoC processors, overcoming limitations in the support provided by DPU IP cores for such networks, while maintaining flexibility to assess different configurations. Additionally, a pipeline strategy is proposed to enable parallel execution of both CNN and LSTM components, optimizing resource utilization and minimizing idle times. To demonstrate the validity of the proposed implementation methodology, experiments were conducted on the ZCU102 de- velopment board, equipped with a Zynq Ultrascale+ MP-SoC, and involved the use of the VGG16 CNN model along with the exploration of different LSTM configurations. The results demonstrate remarkable computational performance, achieving frame rates of up to 44.34 FPS for videos recorded at a resolution of 320x240 pixels, surpassing real-time requirements. Aditionally, the proposed implementation maintains high accuracy levels, exemplified by the single bidirectional LSTM layer achieving a competitive accuracy of 73.33% based on the UCF101 dataset.

18:15
PowerYOLO: Mixed Precision Model for Hardware Efficient Object Detection with Event Data

ABSTRACT. The performance of object detection systems in automotive solutions must be as high as possible, with minimal response time and, due to the often battery-powered operation, low energy consumption. When designing such solutions, we therefore face challenges typical for embedded vision systems: the problem of fitting algorithms of high memory and computational complexity into small low-power devices. In this paper we propose PowerYOLO - a mixed precision solution, which targets three essential elements of such application. First, we propose a system based on a Dynamic Vision Sensor (DVS), which has low power requirements and operates well in conditions with variable illumination. It is these features that may make event cameras a referential choice over frame cameras in some applications. Second, to ensure high accuracy and low memory and computational complexity, we propose to use 4-bit width Powers-of-Two (PoT) quantisation for convolution weights of the YOLO detector, with all other parameters quantised linearly. Finally, we embrace from PoT scheme and replace multiplication with bit-shifting to increase the efficiency of hardware acceleration of such solution, with a special convolution-batch normalisation fusion scheme. All this leads to almost 8x reduction in memory complexity for solution achieving high accuracy of mAP 0.301 on the GEN1 DVS dataset.

17:00-18:30 Session 4B: Architectures and Hardware for Security Applications & Computer-Aided-Design of circuits and systems
Location: Room 106
17:00
Two's Complement: Monitoring Control Flow using Both Power and Electromagnetic Side Channels

ABSTRACT. Embedded devices leak information about their inner activity through power and EM side channels. A defender who measures this leakage can thus use it to monitor the device and ensure its control-flow integrity. Previous works have investigated the use of power and EM side channels for control-flow monitoring, but they have only used a single side channel at a time. In this paper, we propose an approach that integrates both power and EM side channels to detect deviations from the device's normal behavior. Our model takes inspiration from multimodal machine learning used in image and speech recognition, and uses an intermediate integration design which passes multiple input modalities in parallel through a single self-attention transformer network. We evaluate our model on an off-the-shelf device at multiple noise levels, and show that it outperforms models that use only a single channel as input. In particular, we show how the multimodal approach can improve trace classification and anomaly detection accuracies by up to 18% and 11%, respectively, compared to power/EM-only approaches. Additionally, we show that our approach is superior the early and late integration approaches currently used in multimodal side channel analysis work. We release our machine-learning architecture, including trained models based on real-world data, as an open-source repository. Our work highlights how advances in the wider field of machine learning can be used to improve the security of embedded systems.

17:30
Resistance of Radiation Tolerant TMR Shift Registers to Optical Fault Injections - WiP

ABSTRACT. Protection of information is essential for IoT devices such as wireless sensor nodes, embedded systems, control systems, etc. Such devices are often subject to lab analysis with the objective to reveal secret hidden information. One of the ways to reveal the cryptographic key is to perform optical Fault Injection attacks. In this work, we investigated the IHP radiation tolerant shift registers originating from standard library flip-flops. To avoid faults the registers are built of Triple Modular Redundant flip-flops. Redundancy is generally a means to increase the resistance of a design to faults and may be a promising approach against registers manipulations of cryptographic devices. In our experiments, we were able to inject different transient faults into TMR registers using a single laser beam.

17:45
An HLS algorithm for the direct synthesis of complex control flow graphs into finite state machines with implicit datapath

ABSTRACT. In this paper, we introduce a efficient algorithm for automating the direct transformation of a control flow graph (CFG) into a finite-state machine with implicit datapath (FSMD). In our opinion, this transformation has not received sufficient attention: although the passage of a CFG to FSMD is mentioned in many textbooks on digital system design, and presented as trivial, to our knowledge it has never been explicitly formulated nor automated. Our experience shows, moreover, that this process is trickier than claimed. We believe our algorithm can become a key transformation for high-level synthesis (HLS) of control- dominated applications presenting a low level of instruction- level parallelism. Our paper presents the algorithm in detailed executable form. Experimental measurements carried out on synthetic benchmarks shows its effectiveness.

18:00
Integrated Mapping and Scheduling Optimization with Genetic Algorithms

ABSTRACT. Integrated Mapping and Scheduling (IMS) problems can be found in many domains, such as electronic design automation (EDA) and modern manufacturing systems. Optimization algorithms to solve the IMS problems can be used to minimize execution time, implementation cost, energy consumption, etc. Genetic Algorithms (GAs) are powerful evolutionary algorithms for tackling many of such IMS optimization problems. By utilizing biological principles like selection, crossover, and mutation, GAs excel in generating high-quality solutions. Chromosome encoding and decoding, in addition to evolutionary operators, significantly influence GA’s efficiency. This paper introduces a relative-priority genetic algorithm (RPGA), a novel GA for IMS problems, such as those in EDA. RPGA employs a unique encoding and decoding mechanism tailored for IMS problems, especially those with OR nodes representing alternative operation paths. It encodes the relative priority of an operation in a chromosome, which can be divided into two parts: one for path selections and the other for operation scheduling and mappings. Efficient decoding of every chromosome into a solution is facilitated through the concept of a ready operation set. The study extensively compares RPGA and established meta-heuristics using a benchmark set. The experimental results demonstrate that RPGA achieves high solution quality and rapid convergence.

18:15
SplitMS: Split Modulo-Scheduling for Accelerating Loops onto CGRAs

ABSTRACT. Coarse-Grained Reconfigurable Array (CGRA) architectures are popular for accelerating loop kernels due to a good balance between energy efficiency and flexibility. Modulo scheduling (MS) is the preferred solution for efficiently mapping loop Data Flow Graphs (DFGs) onto CGRAs. Existing CGRA MS algorithms suffer from low resource utilization if the number of operation nodes is less than the number of Processing Elements (PEs) in the CGRA. To improve instruction level parallelism (ILP), the common approach is to unroll the loop before applying MS. However, finding valid MS solutions for larger DFGs becomes difficult for heterogeneous CGRAs with resource constraints. In this paper, we propose a novel Split Modulo-Scheduling (SMS) technique to improve the ILP by segmenting the target CGRA into clusters and mapping loop chunks. We present the algorithm to find the optimum split for the best performance and the hardware approach to support cluster execution of loop iterations. Experiments show that SMS for a 4×[2×2] CGRA cluster achieves an average speedup of 2.8× over MS for a 4×4 target CGRA with 8 Load-Store Units (LSUs). SMS increases an average of 2.9× the PE utilization and 3× the energy efficiency over the conventional MS approach. The success rate for SMS is 100% for a wide range of Polybench loop kernels whereas the traditional unroll then MS approach succeeds only 20% times to find valid mappings.

17:00-18:30 Session 4C: Sustainable Digital System Design
Location: Room 116
17:00
Streamlined Models of CMOS Image Sensors Carbon Impacts

ABSTRACT. With the escalating concern about global warming, the environmental impact of electronic devices must be scrutinized. Life Cycle Assessments (LCA) reveal that Integrated Circuits (ICs) are the primary contributors to greenhouse gas emissions in these devices. However, performing an inventory to determine the ICs impact is a complex task due to missing data and the existing studies on ICs have been neglecting CMOS Image Sensors (CIS). Despite the surge in CIS usage, particularly in smartphones, there is a lack of comprehensive models to assess their environmental impact. This paper proposes a multi-level set of models that leverage available information while considering the specificities of CIS. The most comprehensive model incorporates factors such as the total silicon area, geographical location (influencing the energy mix), and the technology node. To accommodate scenarios with incomplete data, subsequent models are designed to effectively utilize averaged parameters. The proposed models are applied to sensors manufactured by STMicroelectronics and Sony, and the results are compared with existing LCA results, such as those from Fairphone. Our approach provides a more comprehensive understanding of the environmental impact of CIS, contributing to the broader goal of reducing the carbon footprint of electronic devices. Our results suggest that the carbon impact of a Fairphone 4 image sensor are underestimated by a factor 3.6.

17:30
HAHMF: Heuristic-Augmented Asymmetric Heterogeneous Splitting for Hardware Efficient Multipliers Framework

ABSTRACT. Achieving energy-efficient arithmetic operations is crucial for sustainable circuit design in modern digital systems. One computational bottleneck is the multiplication of operands with unequal bit-widths, which often necessitates zero-padding and leads to unnecessary hardware overhead and power dissipation. This work tackles this challenge by proposing an optimized asymmetric bit-width multiplication technique. The framework determines the optimal split for asymmetric multiplication by considering the product of critical path delay, silicon occupied area, and power incurred by the design (PDAP) as the fitness function. Compared to conventional 64 x 64 bit padded multiplication for unequal operand bit-size, the optimized 64 x 32 bit asymmetric multiplication achieves an impressive 79.73% gain in PDAP. Furthermore, the heterogeneous split approach outperforms the homogeneous 64 x 32 bit multiplication of equal 32-32 bit splits by 13.83% in PDAP, highlighting the advantages of heterogeneous splitting for asymmetric bit widths. The proposed framework is validated up to 128 bits, starting from 16-bit, by incorporating splits of both even and odd bit widths. Engineered with a particle swarm optimization (PSO) technique, the framework handles multiplication between operands of any bit-width, making it a unique and versatile solution for applications requiring efficient asymmetric bit-width multiplication, such as sustainable circuit design. The multiplier design framework and files are freely available for researchers and designers in the community.

18:00
Towards Sustainable Electronic Design Automation Flows: A Joint Approach Based on Complexity Metrics - WiP

ABSTRACT. This paper addresses sustainability criteria and needs in the area of Electronic Design Automation (EDA). We aim to optimize operational stages of an EDA Flow and address a series of investigations to reduce carbon foot print. Our main purpose is to optimize the design flow taking into account sustaniblitity criteria to reduce the environmental impact of EDA tools.First metrics, which correlate between sustainability and design projects complexity, are provided and implemented as a part of an EDA design solution, which is commercialized by INNOVA Advanced Technologies. As second, INNOVA design solution provide job scheduling based on sustainability criteria. Typical case studies are provided in this paper.

18:15
opoSoM: A Modular Measurement Platform for Dynamic Power Consumption of SoCs - WiP

ABSTRACT. Software can have a significant impact on the electrical characteristics of the executing integrated circuit. The analysis of minor current consumption changes in a System-on-Chip reveals details about the executed instructions or the hardware's internal logic, potentially exposing sensible information and thereby increasing the surface for side-channel attacks. Despite careful design, unwanted signal transitions (glitches) pose a further challenge that needs to be handled at hardware and software level. To address these problems, an appropriate measurement platform is inevitable. In this paper, we propose the open-source opoSoM modular measurement platform to capture the dynamic power characteristics of System-on-Chips. It features external and on-chip measurement techniques to support a wide spectrum of System-on-Chips. Due to the configurable measurement range and synchronous sampling at up to 250 MS/s, the platform provides valuable measurement data for investigating countermeasures against side-channel attacks and optimizing hardware and software towards lower dynamic power consumption.