A Self-Timing Voltage-Mode Sense Amplifier for STT-MRAM Sensing Yield Improvement
ABSTRACT. STT-MRAM (Spin-Transfer Torque Magnetic Random Access Memory) is a potential candidate for the requirement of many IoT and wearable device, which thanks to its fast write speed, negligible leakage current, and high endurance. Nevertheless, the challenge of STT-MRAM voltage-mode read scheme still exist, firstly, the bit line (BL) voltage (VBL) drops from the pre-charge read voltage (VBL_RD) to 0V quite quickly, which can result in a small effective sensing window (TSMW), which is the period when the sense amplifier’s (SA) input voltage larger than its offset voltage (VOS). The second issue is that the occurrence time of maximum read-signal margin (VRSM) usually is different in different cells due to the resistance variation. Consequently, the valid TSMW is quite small, and the limited TSMW leads to a decrease of VRSM and an increase of BL development time. When a conventional voltage-mode sense amplifier (VSA) is used with a common activated timing, the signal to be amplified will encounter degradation at the VSA’s differential inputs and lead to a sensing failure at a low VBL_RD. This paper proposes a self-timed sensing unit that can dynamically track the change of BL voltage, a couple SAs in the unit are reconfigured to the opposite offset states to monitor each other's sensing results, and the sensing operation is immediately stopped when the sensing result is correct. The simulation results show that the proposed architecture can extend TSMW significantly and the sensing yield will be improved effectively.
A Novel Memristor-Reusable Mapping Methodology of In-memory Logic Implementation for High Area-Efficiency
ABSTRACT. The non-volatile resistive memory (memristor) is a promising device candidate for both memory and logic. The true in-memory logic computation can be realized by the stateful Memristor-Aided loGIC (MAGIC) design within memristor crossbar. The existing mapping methods targeting for the MAGIC gates however result in large crossbar array size with highly imbalanced row and column numbers. To increase the area efficiency of logic-in-memory synthesis, a new memristor-reusable mapping methodology is proposed in this paper. With the proposed mapping algorithm, the memristors storing the intermediate results are conditionally reused for multiple logic gates to guarantee a full utilization of crossbar while maintaining low operation latency overhead. A layout-friendly mapping strategy of logic gates is also developed to yield an almost-square layout. Based on the experimental results of ISCAS-85 benchmarks, the layout area of crossbar is significantly reduced by up to 95.62% on average with the proposed mapping method as compared to the previously published mapping methods. Furthermore, the proposed mapping method also reduces the product of layout area and operation latency by up to 94.03% on average as compared with the previous mapping algorithms.
Process Variation-Resilient STT-MTJ based TRNG using Linear Correcting Codes
ABSTRACT. With the increasing applications of artificial intelligence (AI) attacks, the requirement for high-quality security system becomes urgent. True random number generator (TRNG) is the core block of many cryptographic systems, of which the security is determined by the randomness source of TRNG. In this paper, stochastic switching behavior of spin transfer torque magnetic tunnel junction (STT-MTJ) device is exploited for generation of random numbers. Stochastic switching of STT-MTJ provides an excellent physical randomness source. However, due to the limited technology and correlation between different process steps, process variation has a significant impact on the randomness of MTJ based TRNG. Therefore, post processing-based control mechanism is also necessary to guarantee reliable randomness. The method of linear corrector is integrated into STT-MTJ based TRNG, resulting in improved entropy of randomness. The design is implemented by a 40nm CMOS technology and a compact model of the MTJ. By using the output random bitstream with and without process variations, the efficiency of linear corrector is demonstrated by passing the National Institute of Standards and Technology (NIST) statistical test suite.
Spintronic Memories: From Memory to Computing-in-Memory
ABSTRACT. Spintronic memory has been considered as one of the most promising nonvolatile memory candidates to address the leakage power consumption in the post-Moore’s era. To date, the spintronic magnetic random access memory (MRAM) family has mainly evolved in four-generation technology advancement, from toggle-MRAM (product in 2006), to STT-MRAM (product in 2012), to SOT-MRAM (intensive R&D today), and to VCMA-MRAM (intensive R&D today). In addition, another spintronic memory, named racetrack memory (RM), proposed in 2008, has also evolved in two generations from domain wall (DW) based RM to skyrmion-based RM. On the other hand, from the architectural perspective, data transfer bandwidth and the related power consumption has become the most critical bottleneck in von-Neumann computing architecture, owing to the separation of the processor and the memory units and the performance mismatch between the two. Realization of the unity of computing and memory in the same place has opened up a promising research direction of computing-in-memory (CIM). Spintronic memory could be a promising technology to implement the CIM paradigm, owing to its intrinsic processing capability. Lots of interests have been attracted and a number of attempts have been made in this field, within both MRAM and RM. In this paper, we perform a mini review on the R&D evolution of spintronic memories: from memory to computing-in-memory. Particularly, we will introduce our recent work on advanced spintronic memories as well as CIM paradigms implemented within spintronic memories.
REAL: Logic and Arithmetic Operations Embedded in RRAM for General-Purpose Computing
ABSTRACT. As the need of data-intensive applications is growing,
it is challenging to scale von Neumann architecture to meet this
need, due to its unavoidable data movement between processors
and memories. To address such a challenge, computation-in-memory architectures have been proposed which are implemented by either conventional or emerging memory technology.
However, such architectures typically support a narrow range
of operations; for instance, some of them only support logical
operations such as AND, OR. Hence, it is hard to be applied
into a wide spectrum of applications (e.g., some need arithmetic
operations). Furthermore, some operations need to add additional
CMOS logic or significantly modify standard memory array
structures. This paper proposes a set of comprehensive logical
and arithmetic operations that are primitive operations of today’s
computer instruction set and can be embedded into RRAM;
they include seven types bitwise logical operation and three
arithmetic operations. In addition, a new multiplication algorithm
is proposed to embed it in RRAM. As our evaluation shows,
our approach reduce the delay up to 46% using fewer or same
number of memristor devices as compared to the state-of-the-art.
Dynamic Adaptation of Approximate Bit-width for CNNs based on Quantitative Error Resilience
ABSTRACT. As an emerging paradigm for energy-efficiency design, approximate computing can reduce power consumption through simplification of logic circuits. Although calculation errors are caused by approximate computing, their impacts on the final results can be negligible in some error resilient applications, such as Convolutional Neural Networks (CNNs). Therefore, approximate computing has been applied to CNNs to reduce the high demand for computing resources and energy. Compared with the traditional method such as reducing data precision, this paper investigates the effect of approximate computing on the accuracy and power consumption of CNNs. To optimize the approximate computing technology applied to CNNs, we propose a method for quantifying the error resilience of each neuron by theoretical analysis and observe that error resilience varies widely across different neurons. On the basic of quantitative error resilience, dynamic adaptation of approximate bit-width and the corresponding configurable adder are proposed to fully exploit the error resilience of CNNs. Experimental results show that the proposed method further improves the performance of power consumption while maintaining high accuracy. By adopting the optimal approximate bit-width for each layer found by our proposed algorithm, dynamic adaptation of approximate bit-width reduces power consumption by more than 30% and causes less than 1% loss of the accuracy for LeNet-5.
Detecting and Bypassing Trivial Computations in Convolutional Neural Networks
ABSTRACT. Convolutional neural networks (CNNs) recently are able to exceed human accuracy in various application domains such as image recognition, medical diagnosis, and financial analysis.
However, the high computational complexity of CNNs incurs high energy consumption on current hardware implementations.
Existing solutions such as pruning and quantization typically require retraining or fine-tuning to regain accuracy, which can be cost-prohibitive and time-consuming.
This paper proposes a retraining-free approach to reducing the computation workload of CNNs during inference by detecting and bypassing the trivial computations.
We define trivial computations as the computations the results of which can be determined without actual computations.
The examples include multiplication with 0, +1/-1 and addition with 0 or addition of opposite numbers.
Correspondingly, we develop bypass circuits to detect the trivial computations. Once detected, the circuit delivers the pre-determined result without an actual computation. Experimental results on MNIST and EMNIST show that the CNNs with bypass circuits can lead to 30.66-33.52% energy savings without any accuracy loss. This technique can be used together with existing techniques such as pruning and quantization because it is totally complimentary to such techniques.
An Energy-Efficient Architecture for Accelerating Inference of Memory-Augmented Neural Networks
ABSTRACT. Although recurrent neural networks (RNNs) have shown excellent performance in sequence-related applications such as speech recognition and image caption, they are forceless in the domain of cognitive reasoning like question answering and algorithm learning due to the limited memory capacity. To address this issue, memory-augmented neural networks (MANNs) have been proposed to achieve excellent reasoning ability in cognitive applications by coupling neural networks (mostly RNNs) to an external memory which can be written and read. However, MANNs require numerous operations and memory accesses to interact with external memory, which hinders the deployment of MANNs in low-power devices. In this work, we propose an algorithm-hardware cooperated full-stack approach to accelerate the inference of MANNs. Firstly, we propose an operator scheduling mechanism to optimize the calculation process of MANNs for high computation parallelism and inference efficiency. Secondly, we propose a tri-mode softmax computing scheme to reduce calculation overheads for MANNs with different accuracy and latency requirements. Finally, a reconfigurable architecture is designed to efficiently implement each operator in MANNs for high inference speed and energy efficiency. Tested on bAbI dataset, the proposed optimizations and architecture achieves 1.28X improvement of energy efficiency for MANNs compared with GPU implementation.
Implementing Binarized Neural Networks with Magnetoresistive RAM without Error Correction
ABSTRACT. One of the most exciting applications of Spin Torque Magnetoresistive Random Access Memory (ST-MRAM) is the in-memory implementation of deep neural networks, which could allow improving the energy efficiency of Artificial Intelligence by orders of magnitude with regards to its implementation on computers and graphics cards. A particularly stimulating vision is using ST-MRAM for implementing Binarized Neural Networks (BNNs), a class of deep neural networks discovered in 2016, which can achieve state-of-the-art performance with a highly reduced memory footprint with regards to conventional artificial intelligence approaches. The challenge of ST-MRAM, however, is that it is prone to write errors and usually requires the use of error correction. In this work, we show that BNNs can tolerate these bit errors to an outstanding level, through simulations of networks on image recognition tasks (MNIST, CIFAR-10 and ImageNet). If a standard BNN is used, up to 0.1% bit error rate can easily be tolerated with little impact on recognition performance. The requirements for ST-MRAM are therefore a lot less stringent for BNNs than more traditional applications. By consequence, we show that for BNNs, ST-MRAMs can be programmed with weak (low-energy) programming conditions, without error correcting codes. We show that this result can allow the use of low energy and low area ST-MRAM cells, and show that the energy savings at the system level can reach a factor two.
An Energy-Efficient In-Memory BNN Architecture With Time-Domain Analog and Digital Mixed-Signal Processing
ABSTRACT. Neural networks (NN) have been widely used in various applications. However, the high computational complexity and energy consumption of NNs impede their deployment on embedded and mobile platforms, where computational resources and energy are limited. Most of the
existing accelerators are based on traditional Von Neumann(VN) architecture, which separates computing from storage, and the massive data movement between the processor and off-chip memory causes great power consumption. Inmemory computing (IMC) architecture integrates computing and storage together to eliminate explicit memory access, reducing energy-hungry data transmission. On the other hand, binary neural network (BNN) restricts the weight and activation value to either -1 or +1, converting a large number of multiplications into simple bit-wise XNOR logical operations, and significantly reduces the complexity of computation and amount of memory access. We propose an energy efficient inmemory BNN architecture that performs matrix-vector multiplication (M×V) of FC layers. The bit-wise XNOR operation is realized on the bit line of SRAM, making the data fetching is part of the computation. The accumulation and binarization are performed in time domain with analog and digital mixed-signal processing, which can be blended with SRAM to minimize the data transmission overhead. In the operating condition of TT 0.5V 25 ℃ 50MHZ, the energy efficiency is 96.249 TOPS/W, surpassing the conventional digital implementation with similar performance by 2.09×.
Ring-Shaped Content Addressable Memory Based On Spin Orbit Torque Driven Chiral Domain Wall Motions
ABSTRACT. Content addressable memory (CAM) integrated with non-volatile racetrack memory (RM) can perform the lookup function with a very fast speed. However, the bi-directional motion of domain walls (DWs) in the stripe-shaped RM based CAM will cause data overflow issue, resulting in larger physical size and lower performance. In this paper, a ring-shaped CAM based on spin orbit torque (SOT) driven chiral DW motions has been designed and simulated. Thanks to the circuit simplification and high efficiency of chiral DW motions, its area, energy consumption and operational speed can be greatly improved.
ResNet Can Be Pruned 60x: Introducing Network Purification and Unused Path Removal (P-RM) after Weight Pruning
ABSTRACT. The state-of-art DNN structures involve high computation and great demand for memory storage which pose intensive challenge on DNN framework resources. To mitigate the challenges, weight pruning techniques has been studied. However, high accuracy solution for extreme structured pruning that combines different types of structured sparsity still waiting for unraveling due to the extremely reduced weights in DNN networks. In this paper, we propose a DNN framework which combines two different types of structured weight pruning (filter and column prune) by incorporating alternating direction method of multipliers (ADMM) algorithm for better prune performance. We are the first to find non-optimality of ADMM process and unused weights in a structured pruned model, and further design an optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms which are dedicated to post-processing an structured pruned model after ADMM steps. Some high lights shows we achieve 232x compression on LeNet-5, 60x compression on ResNet-18 CIFAR-10 and over 5x compression on AlexNet. We share our models at anonymous link http://bit.ly/2VJ5ktv
A Logic Simplification Approach for Very Large Scale Crosstalk Circuit Designs
ABSTRACT. Crosstalk computing, involving engineered interference between nanoscale metal lines, offers a fresh perspective to scaling through co-existence with CMOS. Through capacitive manipulations and innovative circuit style, not only primitive gates can be implemented, but custom logic cells such as an Adder, Subtractor can be implemented with huge gains. Our simulations show over 5x density and 2x power benefits over CMOS custom designs at 16nm [1]. This paper introduces the Crosstalk circuit style and a key method for large-scale circuit synthesis utilizing existing EDA tool flow. We propose to manipulate the CMOS synthesis flow by adding two extra steps: conversion of the gate-level netlist to Crosstalk implementation friendly netlist through logic simplification and Crosstalk gate mapping, and the inclusion of custom cell libraries for automated placement and layout. Our logic simplification approach first converts Cadence generated structured netlist to Boolean expressions and then uses the synthesis tool (SIS) to obtain majority functions, which is further used to simplify functions for Crosstalk friendly implementations. We compare our approach of logic simplification to that of CMOS and majority logic-based approaches. Crosstalk circuits share some similarities to majority synthesis that are typically applied to Quantum Cellular Automata technology. However, our investigation shows that by closely following Crosstalk’s core circuit styles, most benefits can be achieved. In the best case, our approach shows 36% density improvements over majority synthesis for MCNC benchmark.
Effect of Lattice Defects on the Transport Properties of Graphene Nanoribbon
ABSTRACT. Graphene nanoribbons are the most emerging graphene structures for electronic applications. Here, we present our calculation results on the impact of lattice defects on the transport properties of these structures. Preliminary results indicate that the maximum conductance is reduced significantly while the conductance quantization is lost even for a small number of defects.
ABSTRACT. In this article we present RRAM single-cells based on MIS devices utilizing LPCVD silicon nitride thin layer as resistive switching material. The thin SiN layer was modified by plasma in order to improve the switching characteristics and the overall performance of the memory cell. Extensive material and electronic device characterization are presented.
High speed and reliable Sensing Scheme with Three Voltages for STT-MRAM
ABSTRACT. This paper proposes a novel voltage sensing scheme in order to improve reliability and reduce read delay for spin transfer torque magnetic random access memory (STT-MRAM). This scheme utilizes two reference voltages to form a voltage difference from read voltage. The states of magnetic tunnel junction (MTJ) cells are then accurately read by three-input voltage sense amplifier. Simulation results using 28nm Complementary Metal Oxide Semiconductor (CMOS) technology show that read access time can be reduced to 0.7ns with 100mV voltage difference at 1V supply voltage. Read access time is reduced by 83.5% and 53.3% compared with time-based sensing (TBS) scheme and strong positive feedback (SPF) scheme. In addition, high reliability can be achieved because sensing margin increases with discharge of bit lines in a certain time.
A compact model of stochastic switching in STT magnetic RAM for memory and computing
ABSTRACT. Spin-transfer torque random access memory (STT-RAM) is gaining momentum as a promising technology for high density and embedded nonvolatile memory. To enable the design of STT-RAM circuits for memory and computing, there is a need for accurate compact models capable of predicting the stochastic behavior. Here, we present a detailed model accounting for the anomalous thermal regime of switching deviating from the thermal model below 200 ns. Anomalous switching is explained by the non-linear barrier lowering of the perpendicular magnetic anisotropy (PMA). The model is validated against write error rate (WER) data and applied to the design of random number generator (RNG) and stochastic computing primitives with STT-RAM.
Comprehensive Pulse Shape Induced Failure Analysis in Voltage-Controlled MRAM
ABSTRACT. Voltage-controlled (VC) magnetic random access memory (MRAM) is considered as an alternative magnetic free layer switching mechanism, owing to the reduced bit-cell size and writing energy comparing to spin-transfer-torque (STT)-MRAM. Unlike current driven mechanisms, voltage pulse is applied to modulate the magnetic anisotropy of magnetic tunnel junction (MTJ). The paper investigates the reliability-aware VC-MTJ switching in the 1T-1M bit cell structure; mainly concerns the pulse shape uncertainties, amplitude fluctuations induced writing failure. We propose a write-after-read scheme combined with a reversed bias to alleviate the unsuccessful switching of MTJ, which achieves 52%-68.7% failure rate reduction in parallel/antiparallel state change of MTJ free layer. The design trade-off exists as additional control units, 1.71x writing latency, as well as 3.27x energy consumption realized with a 28nm CMOS process.
Low-Power, High-Speed and High-Density Magnetic Non-Volatile SRAM Design with Voltage-Gated Spin-Orbit Torque
ABSTRACT. This paper proposes two different magnetic non-volatile SRAM (MNV-SRAM) cell circuits for low-power and fast backup operation with a compact cell area. They employ perpendicular magnetic tunnel junctions (p-MTJs) as non-volatile backup storage elements and explore the spin-orbit torque (SOT) with the assistance of the voltage-controlled magnetic anisotropy effect (VCMA), referred to voltage-gated SOT (VGSOT), to perform the backup operation. Owing to the aid of the VCMA effect, the critical SOT write current for 1-ns backup operation can be significantly reduced, thus resulting in high speed and low power consumption. Moreover, such small write current allows to be driven by the cross-coupled inverters in the SRAM cell, instead of a dedicated write driver, thereby leading to low cell area overhead. By using a commercial CMOS 40 nm design kit and a physics-based VGSOT-MTJ model, we have demonstrated their functionalities and evaluated their performance. Compared to previous MNV-SRAM cell circuits, the proposed MNV-SRAM cell circuits can achieve lower backup energy dissipation, smaller backup delay and less cell area overhead. Additionally, the proposed MNV-SRAM cell circuits can achieve field-free backup operation, thus being suitable for practical applications.
Thermal Stable and Fast Perpendicular Shape Anisotropy Magnetic Tunnel Junction
ABSTRACT. Spin transfer torque magnetic random access memory (STT-MRAM) has shown the great potential in building future universal memory. However, the core of STT-MRAM, conventional perpendicular magnetization anisotropy (PMA) magnetic tunnel junction (MTJ) is facing challenges in keeping high thermal stability factor (ΔE), which is essential for reliable data storage. Despite solving the problem of ΔE, perpendicular shape anisotropy (PSA) MTJ still has the drawbacks of slow STT switching and high risk of breakdown. In this paper, we proposed spin obit torque (SOT)-assisted-STT switching mechanism for PSA MTJ. A SIPCE model of PSA MTJ is developed. This model shows great agreements with experimental measurements. Besides, it is a very useful tool for circuit design and simulation. This model shows ΔE of PMA MTJ can be up to 70. Thanks to SOT-assisted-STT switching mechanism, the switching time can be greatly reduced. At last, simulations of non-volatile master flip-flop (NVMFF) circuit is performed to validate the device modeling.