NANOARCH 2023: 18TH ACM INTERNATIONAL SYMPOSIUM ON NANOSCALE ARCHITECTURES
PROGRAM FOR MONDAY, DECEMBER 18TH
Days:
next day
all days

View: session overviewtalk overview

09:00-10:15 Session 1: Keynote Speaker: Subhasish Mitra
Location: Dülfer Hall
09:00
The Future of Hardware Technologies for Computing

ABSTRACT. The computation demands of 21st-century abundant-data workloads, such as AI/machine learning, far exceed the capabilities of today’s computing systems. For example, a Dream AI Chip would ideally co-locate all memory and compute on a single chip, quickly accessible at low energy. Such Dream Chips aren’t realizable today. Computing systems instead use large off-chip memory and spend enormous time and energy shuttling data back and forth. This memory wall gets worse with growing problem sizes, especially as conventional transistor miniaturization gets increasingly difficult. The next leap in computing requires transformative NanoSystems by exploiting the unique characteristics of nanotechnologies and abundant-data workloads. We create new chip architectures through ultra-dense 3D integration of logic and memory – the N3XT 3D approach. Multiple N3XT 3D chips are integrated through a continuum of chip stacking/interposer/wafer-level integration — the N3XT 3D MOSAIC. To scale with growing problem sizes, new Illusion systems orchestrate workload execution on N3XT 3D MOSAIC creating an illusion of a Dream Chip with near-Dream energy and throughput. Several hardware prototypes, built in commercial and research fabrication facilities, demonstrate the effectiveness of our approach. We target 1,000X system-level energy-delay-product benefits, especially for abundant-data workloads. We also address new ways of ensuring robust system operation despite growing challenges of design bugs, manufacturing defects, reliability failures, and security attacks.

10:15-10:30Coffee Break
10:30-12:36 Session 2: Advanced Computing Architectures and Systems
Location: Dülfer Hall
10:30
A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems
PRESENTER: Jiaming Li

ABSTRACT. The computing-in-memory (CIM) technology effectively addresses the bottleneck of data movement in traditional von-Neumann architecture, especially for deep neural network (DNN) acceleration. However, with the improving performance and parallelism of CIM processing elements (PEs), the substantial latency and power overhead caused by high-density intermediate results transmission has become a new bottleneck in CIM architectures. In this paper, we propose a spatial-designed CIM architecture based on the emerging Monolithic 3D (M3D) technology, and a spatiality-aware DNN mapping method for high-performance CIM systems. The proposed architecture introduces a novel hierarchy by implementing staggered tiers, enabling PEs to be shared by multiple tiles, and uses the ultra-dense and lower-power Inter-Layer Vias (ILVs) in M3D as shared buses, enabling CIM PEs to exploit the ultra-high bandwidth of M3D for inter-tile and intra-tile data transfer. Our experiment result shows that the proposed M3D-enabled CIM architecture, combined with the proposed mapping method, achieves a 6.52x latency improvement, a 40.84x interconnection energy-delay product (EDP) improvement, and a 7.62x system-level EDP improvement compared to state-of-the-art CIM architecture.

10:48
Minimal Design of SiDB Gates: An Optimal Basis for Circuits Based on Silicon Dangling Bonds
PRESENTER: Jan Drewniok

ABSTRACT. Silicon Dangling Bonds (SiDBs) present a promising computational technology that goes beyond traditional CMOS. It enables the creation of circuitry using single atoms as elementary components. Since current computational technologies approach their physical limits, SiDBs have attracted significant interest from both academia and industry. More precisely, single SiDBs allow for realizing Boolean functionality. They form gates which, then, are utilized as fundamental building blocks to realize arbitrary circuit logic. However, although, fabrication capabilities are advancing rapidly and initial design automation methodologies have been proposed, the current design of these gates is primarily based on manual methods. This paper presents an approach capable of designing SiDB gates using the minimum number of SiDBs possible for a given Boolean function, and thus, minimizing gate cost. In addition to the guaranteed minimality, this allows to design SiDB gates, which require significantly fewer SiDBs compared to gate designs currently used in the state of the art. This breakthrough simplifies SiDB circuit designs and their corresponding manufacturing processes significantly, thereby accelerating the progress of this promising technology.

11:06
Post-Layout Optimization for Field-Coupled Nanotechnologies
PRESENTER: Simon Hofmann

ABSTRACT. While conventional computing technologies reach their limits, the demand for computation power keeps growing, fueling the interest in post-CMOS technologies. One promising contestant in this domain is Field-coupled Nanocomputing (FCN), which conducts computations based on the repulsion of physical fields at the nanoscale. However, to realize a dedicated functionality in this technology design methods are needed that create corresponding FCN layouts. While several methods for FCN layout generation have been proposed in the past, the underlying complexity requires them to resort to heuristic approaches—leading to results of sub-par quality and offering room for improvement. In conventional CMOS design, post-layout optimization methods are available to exploit this potential for further improvement. Unfortunately, no such methods exists yet for FCN. In this work, we are addressing this gap and introduce the first post-layout optimization approach for FCN. Experimental evaluations show the benefits of the approach: Applied to layouts generated by two complementary state-of-the-art methods, the proposed post-layout optimization allows for a further area reduction of 50.79 % and 20.00 % on average, respectively—confirming the potential of post-layout optimization for FCN.

11:24
Memristor-based Network Switching Architecture for Energy Efficient Cognitive Computational Models
PRESENTER: Saad Saleh

ABSTRACT. The Internet makes use of high performance network switches in-order to route network traffic from end users to servers. Despite line-rate performance, the current switches consume huge energy and lack the ability to support more expressive learning models, like neuromorphic functions. The major reason is the use of transistors in the underlying Ternary Content Addressable Memory (TCAM) which is volatile and supports digital computations only. These shortcomings can be bypassed by developing network memories building on novel components, like Memristors, due to their nonvolatile, nanoscale and analog storage/processing characteristics. In this paper, we propose the use of a novel memristor-based pCAMCogniGron memory which provides both digital (deterministic) and analog (probabilistic) outputs for supporting cognitive computational models in network switches. The traditional digital operations can still be supported by a memristor-based energy efficient TCAM, called TCAmMCogniGron. Building on pCAMCogniGron and TCAmMCogniGron, we propose a novel network switching architecture and analyze its energy efficiency over the experimental dataset of a Nb-doped SrTiO3 memristive device. The results show that the proposed network switching architecture consumes only 0.01 fJ/bit/cell energy for analog compute operations which is 50 times less than the transistor-based TCAM.

11:42
LUT-based RRAM Model for Neural Accelerator Circuit Simulation
PRESENTER: Max Uhlmann

ABSTRACT. Neural hardware accelerators have been proven to be energy-efficient when used to solve tasks which can be mapped into an artificial neural network (ANN) structure. Resistive random-access memories (RRAMs) are currently under investigation together with several different memristive devices as promising technologies to build such accelerators combined together with complementary metal-oxide semiconductor (CMOS)-technologies in integrated circuits (ICs). While many research groups are actively developing sophisticated physical-based representations to better understand the underlying phenomena characterizing these devices, not much work has been dedicated to exploit the trade-off between simulation time and accuracy in the definition of low computational demanding models suitable to be used at many abstraction layers. Indeed, the design of complex mixed-signal systems as a neural hardware accelerators requires frequent interaction between the application- and the circuit-level that can be enabled only with the support of accurate and fast-simulating devices' models. In this work, we propose a solution to fill the aforementioned gap with a lookup table (LUT)-based Verilog-A model of IHP's 1-transistor-1-RRAM (1T1R) cell. In addition, the implementation challenges of conveying the communication between the abstract ANN simulation and the circuital analysis are tackled with a design flow for resistive neural hardware accelerators that features a custom Python wrapper. As a demonstration of the proposed design flow and 1T1R model, an ANN for the MNIST handwritten digit recognition task is assessed with the last layer verified in circuit simulation. The obtained recognition confidence intervals show a considerable discrepancy between the purely application-level PyTorch simulation and the proposed design flow which spans across the abstraction layers down to the circuital analysis.

12:00
Resilience and Precision Assessment of Natural Language Processing Algorithms in Analog In-Memory Computing: A Hardware-Aware Study

ABSTRACT. Natural Language Processing (NLP) serves as a cornerstone technology, facilitating complex human-computer interactions, enabling information retrieval, conducting sentiment analysis, and enhancing language comprehension. With the ever-growing use of NLPs, the conventional 'von Neumann' computing paradigm is rapidly approaching its inherent limitations. In response, Analog In-Memory Computing (AIMC) emerges as a compelling alternative, albeit accompanied by inherent non-idealities when deploying neural networks on such platforms. In this paper, we have evaluated the precision and resilience of various NLP algorithms when executed within the AIMC framework, both with and without the application of hardware-aware training. Our analysis reveals noteworthy insights: Gated Recurrent Unit (GRU) neural networks exhibit enhanced resilience to noise, yielding an average test error of 3.97\% following hardware-aware training, as compared to their full precision counterparts. Conversely, Long Short-Term Memory (LSTM) networks demonstrate a slightly higher average test error of 5.67\%, indicating a relatively lower tolerance to non-idealities. In contrast, Convolutional Neural Networks (CNNs) manifest a heightened vulnerability, exhibiting an average relative test error of 13.34\%. Furthermore, we systematically investigate the sensitivity profiles of the selected neural networks in the presence of specific non-idealities, providing valuable insights into their robustness and susceptibility within the AIMC environment.

12:18
VLCP: A High-Performance FPGA-based CNN Accelerator with Vector-level Cluster Pruning
PRESENTER: Bi Wu

ABSTRACT. Convolutional neural networks (CNNs) are widely used in computer vision, natural language processing, and other application scenarios. But deploying CNNs at the edge is challenging due to their large number of parameters. Pruning is a solution that can effectively reduce the number of parameters and off-chip memory accesses. However, high sparsity unstructured pruning is not hardware-friendly, while structured pruning has low compression efficiency. As a result, vector-level pruning, with a coarser granularity, is a promising alternative that balances pruning performance and hardware-friendliness. In this paper, a hardware-oriented vector-level pruning strategy is proposed based on the CNN vector distribution properties. By expanding the dynamic range of vector groups, more important weights can be preserved without sacrificing accuracy. When applied to the VGG-16 and ResNet-18 models on the ImageNet dataset, the proposed strategy achieved 10.93X and 10.17X compression ratios in convolutional layers with a 66% reduction in computation and an acceptable drop in top-1 accuracy. Furthermore, the proposed pruning scheme achieves a remarkable performance of 188 FPS on the VCU118 evaluation board, demonstrating its compatibility with hardware. Compared to the state-of-the-art, the proposed strategy reaches 69% performance improvement and up to 2.8X higher LUT efficiency.

12:36-13:30Lunch (incl. Group picture)
13:30-14:15 Session 3: Invited Speaker: Hussam Amrouch
Location: Dülfer Hall
13:30
In-Memory Computing using Ferroelectric Transistors: Lessons Learnt and Future Trends

ABSTRACT. In the burgeoning realm of artificial intelligence (AI), the pursuit of In-Memory Computing (IMC) is paramount. This relentless pursuit, aimed at catalyzing ultra-fast and energy-efficient AI computations, is emblematic of the cutting-edge innovations at the nexus of Ferroelectric FET (FeFET) technology. In this talk, we will showcase the latest advancements in FeFETs, spanning from traditional IMC-based hardware accelerators to monolithic 3D integration using advanced back-end-of-line (BEOL) thin-film transistors. We will elucidate the inherent challenges posed by ferroelectric stochasticity along with temperature effects, and demonstrate innovative strategies, such as using thermoelectric devices for advanced on-chip cooling, to mitigate their adverse impacts, paving the way for reliable computing using FeFET-based IMC.

14:15-14:45Coffee Break
16:15-16:30Coffee Break