View: session overviewtalk overview
Hussam Amrouch (Technical University of Munich (TUM), Germany)
10:15 | An Energy-Efficient and Area-Efficient Depthwise Separable Convolution Accelerator with Minimal On-Chip Memory Access PRESENTER: Yi Chen ABSTRACT. In this paper, we present a hardware accelerator for DSC that enables 100% utilization of the processing element (PE) array for depthwise convolution (DWC) and up to 98% utilization for pointwise convolution (PWC), while also reducing latency. By partitioning the input feature map (ifmap) SRAM of the DWC into three banks, we minimize memory access and maximize data reuse. The input activations and weights only need to be loaded once from SRAM to PE for both DWC and PWC. Additionally, to support efficient operations across different layers, we present a layerwise matching method. The proposed DSC accelerator is implemented in 22nm FDSOI technology and validated using MobileNetV1 on the CIFAR10 dataset. The post-layout results demonstrate that the proposed accelerator can operate at 1GHz and achieve an energy efficiency of 5.07 (3.96) TOPS/W and an area efficiency of 519.2 (461.52) GOPS/mm2 for DWC (PWC) at 0.8V. After scaling the supply voltage down to 0.5V, the energy efficiency for the proposed accelerator increases to 13.64 TOPS/W for DWC and 10.64 TOPS/W for PWC, respectively. |
10:33 | Mapping of CNNs on multi-core RRAM-based CIM architectures ABSTRACT. RRAM-based multi-core systems improve the energy efficiency and performance of CNNs. Thereby, the distributed parallel execution of convolutional layers causes critical data dependencies that limit the potential speedup. This paper presents synchronization techniques for parallel inference of convolutional layers on RRAM-based CIM architectures. We propose an architecture optimization that enables efficient data exchange and discuss the impact of different architecture setups on the performance. The corresponding compiler algorithms are optimized for high speedup and low memory consumption during CNN inference. We achieve more than 99% of the theoretical acceleration limit with a marginal data transmission overhead of less than 4% for state-of-the-art CNN benchmarks. |
10:51 | Accelerating Large Kernel Convolutions with Nested Winograd Transformation ABSTRACT. Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4×4 to 31×31 convolutions. |
11:09 | ADaMaT: Towards an Adaptive Dataflow for Maximising Throughput in Neural Network Inference ABSTRACT. With the development of research in hardware for Convolutional Neural Network(CNNs) Algorithms, it becomes crucial to examine the different aspects of hardware design. CNNs are mainly used in computer vision applications, and translating these algorithms into hardware calls for adopting appropriate dataflow to improve the utilisation of hardware resources resulting in higher throughput. In particular, the inference task at each neuron position can be assigned to a compute unit in the hardware accelerator, and several such neuron positions can be completed in parallel. We observe that adopting a static dataflow for an architecture can result in the under-utilisation of resources because of the different dimensions of the data in the network. The motivation of this paper is built upon the need for adaptive dataflow for the design to improve the multiply-and-accumulate (MAC) utilisation in CNNs. We propose a method, ADaMaT, which adapts the dataflow at runtime by appropriately assigning tasks to the MAC units depending on the dimensions of the layers instead of a pre-determined assignment. The adaptive assignment tries to maximise the MAC utilisation and improve the throughput. We have performed a comparative analysis among different static dataflows and our proposed ADaMaT dataflow. |
11:27 | Sparsity Controllable Hyperdimensional Computing for Genome Sequence Matching Acceleration ABSTRACT. In this paper, we propose a Hyper-Dimensional genome analysis platform. Instead of working with original sequences, our method maps the genome sequences into high-dimensional space and performs sequence matching with simple and parallel similarity searches. At the algorithm level, we revisit the sequence searching with brain- like memorization that Hyper-Dimensional computing natively supports. Instead of working on the original data, we map all data points into high-dimensional space, enabling the main sequence searching operations to process in a hardware-friendly way. We accordingly design a density-aware FPGA implementation. Our solution searches the similarity of an encoded query and large-scale genome library through different chunks. We exploit the holographic representation of patterns to stop search operations on libraries with a lower chance of a match. This translates our computation from dense to highly sparse just after a few chuck-based searches. Our evaluation shows that our accelerator can provide 46× speedup and 188× energy efficiency improvement compared to a state-of-the-art GPU implementation. Results show that our accelerator achieves up to 3440.6 GCUPS using a single Xilinx Alveo U280 board. |
Chun-Jen Tsai (National Yang Ming Chiao Tung University, Taiwan)
13:00 | Optimizing Constrained-Modulus Barrett Multiplier for Power and Flexibility ABSTRACT. Fully Homomorphic Encryption (FHE) promises data protection by computing on encrypted data, but demands resource-intensive computation. FHE hardware accelerators, which improve FHE scheme performance with densely packed computing units, could potentially damage the chip with excessive heat dissipation because of high power consumption. Therefore, it is necessary to reduce the power consumption of the accelerator and its most critical module, i.e., modular multiplier. In this work, we extend the idea of the allowing a specific form of modulus to achieve a low-power modular multiplier. The proposed work can reduce power consumption by 15% and area by 20%. We also propose an approximation for the number of moduli available with the discussed constraints on the modulus form. |
13:18 | PRESENTER: Klajd Zyla ABSTRACT. Data centers have been struggling to provide the necessary processing capacity to handle the surging rate of network traffic that is generated in an increasingly connected and service-oriented world. As a result, SmartNICs play an even more important role than before as they can offload various network applications and hence free CPU resources for application-layer processing, increase performance and reduce processing time. However, they often do not support flows with different offload requirements and cannot dynamically allocate offloads in runtime. In order to address these limitations, we propose FlexPipe, a fast, flexible and scalable packet-processing architecture for high-performance SmartNICs. Our design enables low-latency and runtime-reconfigurable packet forwarding at high traffic rates with minimal area overhead. Furthermore, it provides load-aware packet steering toward multiple offload units of the same type for low-bandwidth offloads. We implement a prototype of FlexPipe in Verilog and validate it via cycle-accurate register-transfer level simulations. Our evaluation results show that FlexPipe can process packets of arbitrary sizes with different offload requirements at line rate and on average 1.9x faster than a SmartNIC with a predefined sequence of offloads and 1.8x faster than PANIC, a flexible state-of-the-art SmartNIC. |
13:36 | REMOTE: Re-thinking Task Mapping on Wireless 2.5D Systems-on-Package for Hotspot Removal ABSTRACT. 2.5D Systems-on-Package (SoPs) are composed by several chiplets placed on an interposer. They are becoming increasingly popular as they enable easy integration of electronic components in the same package and high fabrication yields. Nevertheless, they introduce a new bottleneck in inter-chiplet communication, which must be routed through the interposer. Such a constraint favors mapping related tasks on computing cores within the same chiplet, leading to thermal hotspots. In-package wireless technology holds promise to reconsider such a position because integrated wireless antennas provide low-latency and high-bandwidth communication paths, thus bypassing the interposer bottleneck. Furthermore, in this work, we propose a new task mapping heuristic that leverages in-package wireless technology to improve the thermal behavior of 2.5D SoPs executing complex applications. Combining system simulation and thermal modeling, our results show that we can distribute computation in wireless 2.5D SoPs to reduce peak temperatures by up to 24% through task mapping with a negligible performance impact. |
13:54 | ABSTRACT. This paper addresses the optimization of embedded platforms to meet the computing and real-time requirements of cyber-physical systems and IoT applications, including embedded intelligence. In this context, schedulers are vital in enhancing processor utilization in industrial contexts. Although existing research has focused primarily on the schedulability of periodic tasks, event-driven tasks better represent these new embedded intelligence scenarios in the real world. This work explores static and dynamic scheduling policies within a general scenario and a specific case study based on an actual industrial application. The proposed dynamic scheduler has been integrated into the FreeRTOS kernel and has been employed to conduct all of our experiments on industrial products within the smart home domain. Our results show that, while we can respect real-time requirements, our proposed dynamic scheduling can improve the performance of event-driven applications by reducing missed task deadlines by up to 60%. Moreover, we have also developed a lightweight version of our dynamic scheduler for industrial products that reduces average timing overhead for task selection and insertion by up to 34.7% and memory overhead for task creation and list scheduling by up to 74.7% compared to state-of-the-art static alternatives. |
14:12 | Look before you leap: An Access-based Prudent Page Migration for Hybrid Memories ABSTRACT. Hybrid memory composed of DRAM and PCM exploits the benefit of both type of memories. The random page placement in such memories may cause write-intensive pages to be placed in PCM partition which may adversely affect the memory performance due to the higher write latency of PCM. Migration of write-intensive pages to DRAM helps in improving memory service time. Existing techniques migrate pages having write access count greater than a predefined threshold. These techniques do not examine the access pattern once the choice to migrate the page has been made, which might lead to unnecessary migrations because the page may have been hot before the decision, but the number of access may have dropped after migration. To accurately identify the hot page, we propose an access-based prudent page migration method which uses an eDRAM buffer to migrate hot pages from PCM to DRAM. In this paper, we present a look-before-you-leap migration technique where after a page is identified as a hot page, makes a thoughtful decision regarding whether to migrate or not to migrate it. |
Mohsen Imani (Univeristy of California, Irvine, United States)
13:00 | Memory-based computing for energy-efficient AI: Grand challenges |
TRAPDOOR: Repurposing neural network backdoors to detect dataset bias in machine learning-based genomic analysis ABSTRACT. Use of Machine Learning (ML) to understand underlying patterns in gene mutations (genomics) has far-reaching results in diagnosis and treatment for life-threatening diseases like cancer. Success and sustainability of ML algorithms depends on the quality and diversity of training data and under-representation of groups (gender, race, etc.) can lead to exacerbation of systemic discrimination issues. In this work, we propose TRAPDOOR, a methodology for the identification of biased datasets by repurposing, otherwise malicious, neural backdoors. Our methodology can leak potential bias information about the cloud's dataset which is collected in a collaborative setting, without hampering the genuine performance. Using a real-world cancer genomics dataset, we analyze feasibility of leaking bias for gender and race attributes. Our experimental results show that TRAPDOOR can detect the presence of dataset bias with 100% accuracy, and furthermore can also extract the extent of bias by recovering the percentage with a small error. |
Optimized Quantum Circuit Implementation of Payoff Function PRESENTER: Siyi Wang ABSTRACT. Large-scale quantum computers that can execute practical quantum algorithms have the potential to solve complex problems that are currently challenging for classical computers. This involves converting these problems into a form that can be processed by quantum circuits, a crucial process that requires minimizing quantum resources like qubit count, gate count, and circuit depth. Our work focuses on implementing and optimizing the function ${f_K(S) = \max(S-K,0)}$, which is a fundamental task in quantum finance known as option pricing, using quantum circuits. Taking into consideration the significant trade-offs between qubit count and circuit depth, we have developed quantum circuits for the optimized implementation of the $f_K(S)$. Our work incorporates various optimization techniques for the circuit, such as selecting the optimal adder, optimizing the $S-K$ operation, parallelization, and qubit reuse. Furthermore, we offer various versions of our quantum circuits for the $f_K(S)$, each featuring different adders and Toffoli decompositions, thereby providing flexibility for a wide range of use cases. |
Method for Data-Driven Pruning in Micropipeline Circuits ABSTRACT. Asynchronous micropipeline circuits are an effective alternative to their synchronous counterparts for reducing dynamic power consumption, because they offer proper signaling for disabling useless blocks. In this paper, a novel approach is suggested that uses such signaling to prune irrelevant data. The result is a decrease in switching activity, dynamic consumption, and an increase of the average throughput. The method relies on new control-path elements, which conditionally prune data-path elements. Thanks to these controllers, the registers do not sample new data when pruned and subsequent data propagation is avoided. The former can replace standard controllers in any micropipeline circuit without changing the architecture of the control-path. Furthermore, the outline of the methodology is given for a micropipeline circuit and the method is explained through the illustrative example of a digital filter. |
A RISC-V SoC with Hardware Trojans: Case Study on Trojan-ing the On-Chip Protocol Conversion ABSTRACT. Hardware Trojans (HTs) are a serious security threat to the highly-decentralized, multi-stage production flow of today’s Integrated Circuit (IC) industry. Considerable research efforts have gone into developing methodologies for detecting HTs. A significant issue in validating HT detection algorithms is the lack of open-source benchmarks with the complexity of modern-day System-on-Chips (SoCs). The currently available open-source benchmarks are more elementary and, therefore, do not reveal the actual robustness of the algorithms against false positives and false negatives. To address this issue, we present the design and integration of three kinds of HTs in a RISC-V—based SoC. We explain their functionality and taxonomy in detail. This is the first work that launches trojan attacks targeting the mismatch in the attributes of two widely-used on-chip communication protocols in an SoC. We performed extensive behavioral simulations to verify the functionality of each of these HTs. We estimated the detectability of these HTs by: (i) synthesizing the HT-infested SoC for FPGA and (ii) evaluating them against a Graph Neural Network-based pre-Silicon HT detection tool, automatic test pattern generation, reverse engineering, and formal verification. In a nutshell, this paper demonstrates the risk of HTs and an effective environment for strengthening research on HT detection. |
An Evaluation and Architecture Exploration Engine for CNN Accelerators through Extensive Dataflow Analysis ABSTRACT. Systolic array is one of the popular convolutional neural network accelerator architectures due to its high computation efficiency. Nevertheless, the huge design space and complicated interactions among different design parameters make it hard to find the best configuration for various applications. To overcome this issue, this paper presents an evaluation and design space exploration engine, NNeed, for systolic-array CNN accelerators through extensive dataflow analysis. It uses a highly configurable hardware template to describe accelerator operations in detail. The rapid evaluation provides PPA results, pipeline stage analysis, external memory access statistics, and so on. NNeed explores the 9-dimensional design space and supports multiple objective functions for design optimization. Experimental results show that NNeed can generate an accelerator configuration with up to 23% and 50% improvement in performance and energy as compared with a typical handcrafted design. |
Zero-Trust Communication between Chips PRESENTER: Kais Belwafi ABSTRACT. Outsourcing chip production is a common practice among semiconductor vendors to cope with the increasing demand for integrated circuits. This has resulted in several security issues in the chip supply chain, including hardware trojans, intellectual property theft, and overproduction. The concept of zero trust --never trust, always verify-- presents a promising solution for ensuring the authenticity of Integrated Circuits (ICs), particularly in critical systems where adversary attacks can cause significant losses or damage. The Security Protocol and Data Model (SPDM) is a reliable protocol that uses certificates to ensure the authenticity of ICs. Based on this protocol, the presented paper proposes a chip-to-chip zero-trust security architecture that aims to verify the authenticity of any connected peripheral before its use. The contributions include an overview of the proposed architecture, implementation and formal verification of the SPDM protocol, and analysis of the challenges encountered during the implementation and execution. |
Dynamic Digital Circuit Locking (DDCL): A Shield against Static Analysis Attacks PRESENTER: Ye Ziyang ABSTRACT. With the rise of the fabless business model, security threats, including Intellectual Property (IP) theft, overproduction, counterfeiting, and reverse engineering, have increased. This paper introduces Dynamic Digital Circuit Locking (DDCL) as a method to counteract these threats. At its core, DDCL utilizes dynamic logic gates for locking. These gates mimic the operation of standard logic gates through a dynamic process, thereby exploiting adversaries who depend on static digital circuit analysis. As a result, DDCL can resist all static analysis attacks more effectively than conventional techniques. DDCL surpasses earlier methods by its reliance on logic loops for proper operation, which makes loop breaking attacks less effective. However, the advanced security offered by DDCL also presents challenges, such as increased power consumption and circuit complexity. This paper further examines the structure, security aspects, and comparative performance of DDCL. It underscores its value in multi-vendor scenarios and its compatibility with existing IP cores, which require only minor changes to original designs, thereby illustrating the practical role of DDCL in enhancing hardware security. |
Integrated Dynamic Memory Manager for a RISC-V Processor ABSTRACT. In this paper, we present an open-source RISC-V processor with an integrated dynamic memory manager hardware module. Traditionally, the management of the main memory of a computing system is handled by a software library. However, the process involves searching and manipulation of the link lists of memory blocks, which can be expensive when the memory becomes fragmented. As a result, for embedded systems that have to be online for a long duration, a static data structure is often used to reduce the overhead of dynamic memory management at the cost of less software flexibility. Nevertheless, modern VLSI technology allows the efficient implementations of hardwired resource managers directly into the processor microarchitecture for better performance. As the experiments in this paper show, a hardware memory manager integrated within the processor core can be much more efficient than using a software library. Hardwired resource managers are particularly useful for IOT devices since the processors typically run at a lower clock rate. The proposed architecture is implemented and verified on a Xilinx FPGA development board and will be made open source. |
15:30 | 3.125GS/s, 4.9 ENOB, 109 fJ/Conversion Time-Domain ADC for Backplane Interconnect ABSTRACT. This paper presents a flash, Time Domain ADC with T/H amplifier, Voltage Controlled Delay Line and Time to Digital Converter. The design is operating at 3.125GS/s with 4.9 ENOB and a Walden figure of merit of 109fJ/Conversion. Automatic calibration means are provided as well. For measurements purposes, an integrated memory is provided. It consumes 16.2mW from a 1V supply. It was realized in the 45nm PDSOI from Global Foundries. |
15:48 | ABSTRACT. In Time Division Duplex Systems (TDD) the transmitter and receiver are not ON simultaneously. This is done to avoid the saturation of the receiver’s LNA from the high power that can get leaked into the receiver due to limited isolation. Since, either the PA or the LNA is working at a given instant of time, the area and power can be saved by designing a circuit that can work as LNA or PA at different instances in time. Low power and reduced Si area leads to reliability and cost reduction. This work presents the design of a bi-directional amplifier (BDA) in sub 5GHz range. The design contains a two-stage inductor-less LNA along with transistor-based switches and 50Ω drivers for off-chip testing. The design is implemented in Skywater 130nm CMOS open-source PDK. The results show 29.44dB voltage gain in reception mode (without buffer), less than 3.5dB noise figure in 1-4.5GHz band with input matching of -10dB. The 1-dB compression point in the transmission mode is 19dBm with a power consumption of 58mW. The active chip area of the design is 204x166μm2 |
16:06 | Gain Enhancement of Antenna-on-Chip at 94 GHz with an Integrated Artificial Magnetic Conductor for 6G System-on-Chip ABSTRACT. Silicon-based complementary metal oxide semiconductor (CMOS) process has become one of the most popular processes to realize system-on-chip (SoC). However, as one of the essential components of wireless SoC, antennas are typically suffering from the poor radiation because of the highly conductive silicon substrate. Such antennas are known as antenna-on-chip (AoC). To enhance the radiation performance of AoC, an artificial magnetic conductor (AMC) with double periodic strip structure layers has been proposed in this paper that can not only provide in-phase reflection but also isolate the antenna from the lossy silicon substrate. The proposed AMC shows a gain enhancement of 5 dB. The AMC-backed AoC is well-matched within 76-123 GHz and provides a boresight gain of 2.5 dBi at 94 GHz. |
16:24 | ABSTRACT. A unity feedback length-extend multistage noise-shaping (MASH) delta-sigma modulator (DSM) is presented in this paper. The proposed length extension technique adds a feedback path, like HK-MASH, and fixes the feedback factor to unity. In this way, only one additional register is needed, which decreases the operating time and the hardware cost of DSM, and achieves maximum sequence length of M-1 (M=2^(n0 ), n0 is input word length). Using this structure, MASH DSMs are optimized with maximum sequence length extending to (M-1)^l ( l is the order of the MASH DSM) and the minimum extending to N(M-1)^(l-1) ( N is the smallest prime number of M-1). This paper proves that the output sequence length is exponentially increased, regardless of the input value. Compared with classical structure, the proposed MASH 1-1-1 structure shows a spur-free performance at the expense of limited hardware cost. |
16:42 | PRESENTER: Muhammad Jawad Shakil ABSTRACT. A Flash ADC-assisted fast transient response DC-DC buck converter is proposed in this paper. The proposed inductor-based DC-DC converter operates in continuous conduction mode (CCM). The novelty in architecture includes replacing SAR ADC with Flash ADC which converts the transient overshoot/undershoot into corresponding binary code at a faster rate as compared to SAR ADC. Depending on the overshoot and undershoot set criteria, the current pump injects the current into the output node to charge /discharge the output node to minimize the overshoot/undershoot and settling time. In the proposed design, required inductance is achieved by using bond-wire inductance, which reduces the chip's active area, and off-chip components. ADS simulations are performed for the estimation of inductor values using the QFN-64 package. The proposed DC-DC converter occupies an active area of 0.91mm2 using TSMC 130 nm bulk CMOS process. Post layout simulation results show the peak efficiency of 84% at Vin of 3.3 V, Vout of 1.8 V with the load current IL of 500 mA. The measured overshoot/undershoot in simulation is 70/60 mV with 530/469 ns recovery time and 7 mV output ripple. This enables the designed converter to be used in applications like aerospace, satellite, as well as military applications, where compactness is required. |