View: session overviewtalk overview
| 10:30 | A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems PRESENTER: Jiaming Li ABSTRACT. The computing-in-memory (CIM) technology effectively addresses the bottleneck of data movement in traditional von-Neumann architecture, especially for deep neural network (DNN) acceleration. However, with the improving performance and parallelism of CIM processing elements (PEs), the substantial latency and power overhead caused by high-density intermediate results transmission has become a new bottleneck in CIM architectures. In this paper, we propose a spatial-designed CIM architecture based on the emerging Monolithic 3D (M3D) technology, and a spatiality-aware DNN mapping method for high-performance CIM systems. The proposed architecture introduces a novel hierarchy by implementing staggered tiers, enabling PEs to be shared by multiple tiles, and uses the ultra-dense and lower-power Inter-Layer Vias (ILVs) in M3D as shared buses, enabling CIM PEs to exploit the ultra-high bandwidth of M3D for inter-tile and intra-tile data transfer. Our experiment result shows that the proposed M3D-enabled CIM architecture, combined with the proposed mapping method, achieves a 6.52x latency improvement, a 40.84x interconnection energy-delay product (EDP) improvement, and a 7.62x system-level EDP improvement compared to state-of-the-art CIM architecture. |
| 10:48 | Minimal Design of SiDB Gates: An Optimal Basis for Circuits Based on Silicon Dangling Bonds PRESENTER: Jan Drewniok ABSTRACT. Silicon Dangling Bonds (SiDBs) present a promising computational technology that goes beyond traditional CMOS. It enables the creation of circuitry using single atoms as elementary components. Since current computational technologies approach their physical limits, SiDBs have attracted significant interest from both academia and industry. More precisely, single SiDBs allow for realizing Boolean functionality. They form gates which, then, are utilized as fundamental building blocks to realize arbitrary circuit logic. However, although, fabrication capabilities are advancing rapidly and initial design automation methodologies have been proposed, the current design of these gates is primarily based on manual methods. This paper presents an approach capable of designing SiDB gates using the minimum number of SiDBs possible for a given Boolean function, and thus, minimizing gate cost. In addition to the guaranteed minimality, this allows to design SiDB gates, which require significantly fewer SiDBs compared to gate designs currently used in the state of the art. This breakthrough simplifies SiDB circuit designs and their corresponding manufacturing processes significantly, thereby accelerating the progress of this promising technology. |
| 11:06 | Post-Layout Optimization for Field-Coupled Nanotechnologies PRESENTER: Simon Hofmann ABSTRACT. While conventional computing technologies reach their limits, the demand for computation power keeps growing, fueling the interest in post-CMOS technologies. One promising contestant in this domain is Field-coupled Nanocomputing (FCN), which conducts computations based on the repulsion of physical fields at the nanoscale. However, to realize a dedicated functionality in this technology design methods are needed that create corresponding FCN layouts. While several methods for FCN layout generation have been proposed in the past, the underlying complexity requires them to resort to heuristic approaches—leading to results of sub-par quality and offering room for improvement. In conventional CMOS design, post-layout optimization methods are available to exploit this potential for further improvement. Unfortunately, no such methods exists yet for FCN. In this work, we are addressing this gap and introduce the first post-layout optimization approach for FCN. Experimental evaluations show the benefits of the approach: Applied to layouts generated by two complementary state-of-the-art methods, the proposed post-layout optimization allows for a further area reduction of 50.79 % and 20.00 % on average, respectively—confirming the potential of post-layout optimization for FCN. |
| 11:24 | Memristor-based Network Switching Architecture for Energy Efficient Cognitive Computational Models PRESENTER: Saad Saleh ABSTRACT. The Internet makes use of high performance network switches in-order to route network traffic from end users to servers. Despite line-rate performance, the current switches consume huge energy and lack the ability to support more expressive learning models, like neuromorphic functions. The major reason is the use of transistors in the underlying Ternary Content Addressable Memory (TCAM) which is volatile and supports digital computations only. These shortcomings can be bypassed by developing network memories building on novel components, like Memristors, due to their nonvolatile, nanoscale and analog storage/processing characteristics. In this paper, we propose the use of a novel memristor-based pCAMCogniGron memory which provides both digital (deterministic) and analog (probabilistic) outputs for supporting cognitive computational models in network switches. The traditional digital operations can still be supported by a memristor-based energy efficient TCAM, called TCAmMCogniGron. Building on pCAMCogniGron and TCAmMCogniGron, we propose a novel network switching architecture and analyze its energy efficiency over the experimental dataset of a Nb-doped SrTiO3 memristive device. The results show that the proposed network switching architecture consumes only 0.01 fJ/bit/cell energy for analog compute operations which is 50 times less than the transistor-based TCAM. |
| 11:42 | LUT-based RRAM Model for Neural Accelerator Circuit Simulation PRESENTER: Max Uhlmann ABSTRACT. Neural hardware accelerators have been proven to be energy-efficient when used to solve tasks which can be mapped into an artificial neural network (ANN) structure. Resistive random-access memories (RRAMs) are currently under investigation together with several different memristive devices as promising technologies to build such accelerators combined together with complementary metal-oxide semiconductor (CMOS)-technologies in integrated circuits (ICs). While many research groups are actively developing sophisticated physical-based representations to better understand the underlying phenomena characterizing these devices, not much work has been dedicated to exploit the trade-off between simulation time and accuracy in the definition of low computational demanding models suitable to be used at many abstraction layers. Indeed, the design of complex mixed-signal systems as a neural hardware accelerators requires frequent interaction between the application- and the circuit-level that can be enabled only with the support of accurate and fast-simulating devices' models. In this work, we propose a solution to fill the aforementioned gap with a lookup table (LUT)-based Verilog-A model of IHP's 1-transistor-1-RRAM (1T1R) cell. In addition, the implementation challenges of conveying the communication between the abstract ANN simulation and the circuital analysis are tackled with a design flow for resistive neural hardware accelerators that features a custom Python wrapper. As a demonstration of the proposed design flow and 1T1R model, an ANN for the MNIST handwritten digit recognition task is assessed with the last layer verified in circuit simulation. The obtained recognition confidence intervals show a considerable discrepancy between the purely application-level PyTorch simulation and the proposed design flow which spans across the abstraction layers down to the circuital analysis. |
| 12:00 | Resilience and Precision Assessment of Natural Language Processing Algorithms in Analog In-Memory Computing: A Hardware-Aware Study PRESENTER: Amirhossein Parvaresh ABSTRACT. Natural Language Processing (NLP) serves as a cornerstone technology, facilitating complex human-computer interactions, enabling information retrieval, conducting sentiment analysis, and enhancing language comprehension. With the ever-growing use of NLPs, the conventional 'von Neumann' computing paradigm is rapidly approaching its inherent limitations. In response, Analog In-Memory Computing (AIMC) emerges as a compelling alternative, albeit accompanied by inherent non-idealities when deploying neural networks on such platforms. In this paper, we have evaluated the precision and resilience of various NLP algorithms when executed within the AIMC framework, both with and without the application of hardware-aware training. Our analysis reveals noteworthy insights: Gated Recurrent Unit (GRU) neural networks exhibit enhanced resilience to noise, yielding an average test error of 3.97\% following hardware-aware training, as compared to their full precision counterparts. Conversely, Long Short-Term Memory (LSTM) networks demonstrate a slightly higher average test error of 5.67\%, indicating a relatively lower tolerance to non-idealities. In contrast, Convolutional Neural Networks (CNNs) manifest a heightened vulnerability, exhibiting an average relative test error of 13.34\%. Furthermore, we systematically investigate the sensitivity profiles of the selected neural networks in the presence of specific non-idealities, providing valuable insights into their robustness and susceptibility within the AIMC environment. |
| 12:18 | VLCP: A High-Performance FPGA-based CNN Accelerator with Vector-level Cluster Pruning PRESENTER: Bi Wu ABSTRACT. Convolutional neural networks (CNNs) are widely used in computer vision, natural language processing, and other application scenarios. But deploying CNNs at the edge is challenging due to their large number of parameters. Pruning is a solution that can effectively reduce the number of parameters and off-chip memory accesses. However, high sparsity unstructured pruning is not hardware-friendly, while structured pruning has low compression efficiency. As a result, vector-level pruning, with a coarser granularity, is a promising alternative that balances pruning performance and hardware-friendliness. In this paper, a hardware-oriented vector-level pruning strategy is proposed based on the CNN vector distribution properties. By expanding the dynamic range of vector groups, more important weights can be preserved without sacrificing accuracy. When applied to the VGG-16 and ResNet-18 models on the ImageNet dataset, the proposed strategy achieved 10.93X and 10.17X compression ratios in convolutional layers with a 66% reduction in computation and an acceptable drop in top-1 accuracy. Furthermore, the proposed pruning scheme achieves a remarkable performance of 188 FPS on the VCU118 evaluation board, demonstrating its compatibility with hardware. Compared to the state-of-the-art, the proposed strategy reaches 69% performance improvement and up to 2.8X higher LUT efficiency. |