View: session overviewtalk overview
ICRC Technical Session 4: Exploiting Emerging Devices for Complex Logic
Located in Amici (4th floor)
ICRC Technical Session 5:Emerging Devices for Scalable Computing
09:55 | A Hybrid RRAM-CMOS Platform for Prototyping Digital and Analog Projects ABSTRACT. RRAM provides fantastic opportunities to integrate logic and memory tightly and allows low-power computing, in particular for Artificial Intelligence models and neuromorphic computing. Unfortunately, the behavior of RRAM is highly complex and partly stochastic: typical device models utilize lookup tables, are simple analytic models, or compute-intensive physics-based models. Yet they never provide a truly accurate prediction of their behavior. It is therefore essential to prototype computing concepts involving RRAM experimentally. We present a fabricated and tested prototyping platform, associating an array of 8,192 hafnium-oxide-based RRAM and a collection of CMOS periphery circuits. Our platform is multi-paradigm, which permits prototyping a wide range of both digital and analog projects. |
10:05 | Evaluation of Ferroelectric Devices for Neuromorphic Applications ABSTRACT. Hafnia-based ferroelectric devices are promising for future embedded electronics, but their performance is largely untested when integrated into neurormorphic computing architectures. In this work, we investigate how these devices perform in a variety of classification and control tasks. Within our framework, we show that different read voltages on these devices can have an impact on performance, but overall, these devices as synapses can achieve comparable or better performance within a neuromorphic system than neuromorphic systems with more continuous, higher precision synaptic weights. |
10:25 | Accurate and Efficient Reservoir Computing with a Multifunctional Proton-Copper ECRAM PRESENTER: Caroline Smith ABSTRACT. The prevalent adoption and scaling of artificial neural networks (ANNs) has led to development of alternative hardware architectures, such as analog in-memory computing (AIMC), which have energy efficiency advantages over traditional von Neumann architectures [2]. AIMC systems based on resistive memory (or memristor) arrays enable parallel computations, providing high computational capacity and energy efficiency [1], [2]. Memristor-based AIMC architectures are suitable for implementing recurrent neural networks (RNNs) which are used for machine learning tasks related to time series and sequences. However, memristor-based RNN implementations face challenges, such as convergence issues, vanishing gradients, and sensitivity to memristor conductance variability and system noise [2]. These challenges are circumvented by a class of RNN called reservoir computing (RC) systems. A general RC (Fig. 1 (a)) system includes an input layer, a reservoir, and a readout layer [2], [3]. The RC system is configured to receive a temporal signal at the input layer. The input layer passes the signal (Win) to the reservoir, which includes a recursive network configured to exhibit dynamic behavior [2]. Reservoir dynamics enable the mapping of the signal onto a higher-dimensional space, making its features linearly separable, improving classification and prediction capabilities [3]. Unlike a traditional RNN, the reservoir’s weights (WR) are random and fixed, making it more suitable for hardware implementation. The reservoir output is found using output weights (Wout), with only the Wout being trained during the training process, thereby making the RC system training more efficient and less complex than other computing methods [2]. Once this training is completed, devices with static behavior can be used to represent the Wout. Typical outputs include classification or prediction of the input signal. In memristor-based AIMC RC hardware implementations, dynamic (i.e. volatile) devices serve as the reservoir. The reservoir’s weights correspond to the conductance of these devices; a high (low) conductance state translates to a large (small) weight [2]. The resulting output of these dynamic devices in the reservoir network can then be received by the readout layer of the RC system, which is made up of an array of devices exhibiting static behavior. During inference, the devices in the readout layer retain specific conductance values that represent the trained weights, allowing both the trained weights and the summation of all reservoir states to be evaluated, giving the final RC output [2]. RC systems typically use two terminal resistive memory to represent the weights. In this case, two different device behaviors are required for the reservoir and the readout layer. Specifically, a fast decay is required for the former while the latter requires a highly stable state [2]. While this configuration has been successful in achieving high energy efficiency and accuracy, the use of two types of devices reduces system flexibility and requires a complex hardware integration scheme. Recently, an electrochemical random-access memory device (ECRAM), illustrated in Fig. 2(a), has been developed which exhibits both dynamic (Fig. 2(b)) and static behavior (Fig. 2(c)-(d)) required for performing tasks such as prediction and classification in a fully analog RC system. Typical ECRAM cells modulate channel conductance using a single ion species, such as protons or oxygen vacancies. Our device instead utilizes two ion species: protons, which hold a transient state with short retention and copper ions, which hold a state that is stable over long periods of time. In this paper, accuracy of a dual-ion ECRAM-based AIMC RC system is evaluated in simulations based on physical data representing both the copper and proton controlled states, representing the readout layer and reservoir. Experimentally characterized nonidealities including drift and programming error, noise, and dynamic decay over time, have been measured and simulated using a diffusion-drift model. The accuracy assessment utilizes our custom code to model the dynamic layer accuracy and CrossSim for the output layer [4]. Temporal inputs, such as Mackey-Glass signals and electrocardiogram (ECG) data, are used to benchmark the system’s accuracy and efficiency. This includes classification and prediction of Mackey-Glass waveforms and arrhythmia detection in ECG. REFERENCES [1] T. P. Xiao, et al , Applied Physics Reviews 7, 031301, 2020. [2] Y. Zhong, et al., Nature Electronics, vol. 5, no. 10, pp. 672–681, 2022. [3] G. Tanaka, et al., Nature Electronics, vol. 115, pp. 100-123, 2019. [4] T. P. Xiao et al, “CrossSim Inference Manual (v2.0)”, 2022. |
10:35 | Scalable Spintronic Synapses for Analog In-Memory Computing Based on Exchange-Coupled Nanostructures ABSTRACT. The pursuit of high-performance and energy-efficient computing for data-intensive algorithms such as deep neural networks (DNN) opens up exciting opportunities for emerging non-volatile memories (NVM). Particularly, implementing such non-volatile memory units in crossbar arrays as weight matrix storage can provide highly parallel and efficient means of processing matrix-vector multiplications, providing synaptic functionality for the neuromorphic computing paradigm. While numerous memristive and phase-change device systems have been investigated for synaptic crossbar arrays, it remains challenging to provide robust and efficient device technology for multi-bit (analog) synapses. In this work, a multi-level spintronic device based on a magnetic tunnel junction (MTJ) device is proposed and studied. By integrating a standard MTJ free layer exchange coupled with a granular magnetic nanostructure, multiple near-continuous resistive states can be induced thanks to the distribution of the energy barrier among individual magnetic grains. Our simulation analysis demonstrated superior scalability with small variability compared to other means of multi-level devices. System-level simulation demonstrates that enabling 2-bit per cell MRAM crossbars leads up to 3.4x improvement in hardware efficiency while maintaining the inference accuracy. |
Located in Amici (4th floor)
ICRC Technical Session 6:Temporal & Spiking Neuromorphic Systems
Above Ash (16th floor)
Keynote
Located in Amici (4th floor)
ICRC Technical Session 7:New Frontiers in Quantum Computing
14:55 | Superconnectors: A Latency Insensitive Approach to SFQ Design ABSTRACT. Superconducting logic offers the potential for in- credibly high speed, low power computation, due to its gate level clocking and a lack of resistive losses. This results in large, statically scheduled pipelines with clock speeds up to 50 GHz, offering orders of magnitude better throughput than modern digital systems. To fully utilize these pipelines data must be very carefully orchestrated both outside of the system and within the pipeline itself. However, memory systems, data dependent operations, and IO introduce timing uncertainty which can cause a significant degradation in throughput and utilization. In the digital domain a rich set of latency insensitive design (LID) principles exist for this problem, but the tight combinational feedback inherent to their operation introduces a new set of challenges when integrated with single flux quanta (SFQ) based designs. We investigate these challenges by examining two classical methods for LID, the ready/valid protocol and LID-1ss. We show how a naive, direct implementation of these protocols removes much of the benefits of superconducting logic. We then explore how LID-1ss can be optimized for SFQ, resulting in better throughput and a simpler design. However, this opti- mized version still significantly reduces the maximum potential throughput, motivating us to propose Superconnectors: a novel SFQ-specific architecture for LID. Superconnectors leverage passive transmission line buffers, asynchronous race logic control signals, and batched transactions for a hardware design that has minimal impact on the underlying logic and throughput. We then demonstrate Superconnectors with a merge operation on an array of multipliers, introducing 44% less pipeline stall than the optimized LID-1ss approach with no impact to the achievable clock speed of the underlying module. |
15:15 | Neural Network Enhanced Robustness for Noisy Quantum Applications ABSTRACT. Due to the limitations of current NISQ systems, error mitigation strategies are under development to alleviate the negative effects of error-inducing noise on quantum applications. This work proposes the use of machine learning (ML) as an error mitigation strategy, using ML to identify the accurate solutions to a quantum application in the presence of noise. Methods of encoding the probabilistic solution space of a basis-encoded quantum algorithm are researched to identify the characteristics which represent good ML training inputs. A multilayer perceptron artificial neural network (MLP ANN) was trained on the results of 8-state and 16-state basis-encoded quantum applications both in the presence of noise and in noise-free simulation. It is demonstrated using simulated quantum hardware and probabilistic noise models that a sufficiently trained model may identify accurate solutions to a quantum applications with over 90% precision and 80% recall on select data. The model makes confident predictions even with enough noise that the solutions cannot be determined by direct observation, and when it cannot, it can identify the inconclusive experiments as candidates for other error mitigation techniques. |
15:35 | The Dilemma of Random Parameter Initialization and Barren Plateaus in Variational Quantum Algorithms PRESENTER: Muhammad Kashif ABSTRACT. This paper presents an easy-to-implement approach to mitigate the challenges posed by barren plateaus (BPs) in randomly initialized parameterized quantum circuits (PQCs) within variational quantum algorithms (VQAs). Recent state-of-the-art research is flooded with a plethora of specialized strategies to overcome BPs, however, our rigorous analysis reveals that these challenging and resource heavy techniques to tackle BPs may not be required. Instead, a careful selection of distribution range to initialize the parameters of PQCs can effectively address this issue without complex modifications. We systematically investigate how different ranges of randomly generated parameters influence the occurrence of BPs in VQAs, providing a straightforward yet effective strategy to significantly mitigate BPs and eventually improve the efficiency and feasibility of VQAs. This method simplifies the implementation process and considerably reduces the computational overhead associated with more complex initialization schemes. Our comprehensive empirical validation demonstrates the viability of this approach, highlighting its potential to make VQAs more accessible and practical for a broader range of quantum computing applications. Additionally, our work provides a clear path forward for quantum algorithm developers seeking to mitigate BPs and unlock the full potential of VQAs. |
15:55 | COMPASS: Compiler Pass SelectionFor Improving Fidelity Of NISQ Applications PRESENTER: Siddharth Dangwal ABSTRACT. Noisy qubit devices limit the fidelity of programs executed on near-term or Noisy Intermediate Scale Quantum (NISQ) systems. The fidelity of NISQ applications can be improved by using various optimizations during program compilation (or transpilation). These optimizations or passes are designed to minimize circuit depth (or program duration), steer more computations on devices with lowest error rates, and reduce the communication overheads involved in performing two-qubit operations between non-adjacent qubits. Additionally, standalone optimizations have been proposed to reduce the impact of crosstalk, measurement, idling, and correlated errors. However, our experiments using real IBM quantum hardware show that using all optimizations simultaneously often leads to sub-optimal performance and the highest improvement in application fidelity is obtained when only a subset of passes are used. Unfortunately, identifying the optimal pass combination is non-trivial as it depends on the application and device specific properties. In this paper, we propose COMPASS, an automated software framework for optimal Compiler Pass Selection for quantum programs. COMPASS uses dummy circuits that resemble a given program but is composed of only Clifford gates and thus, can be efficiently simulated classically to obtain its correct output. The optimal pass set for the dummy circuit is identified by evaluating the efficacy of different pass combinations and this set is then used to compile the given program. Our experiments using real IBMQ machines show that COMPASS improves the application fidelity by 4.3x on average and by upto 248.8x compared to the baseline. However, the complexity of this search scales exponential in the number of compiler steps. To overcome this drawback, we propose Efficient COMPASS (E-COMPASS) that leverages a divide-and-conquer approach to split the passes into sub-groups and exhaustively searching within each sub-group. Our evaluations show that E-COMPASS improves fidelity by 3.0x on average and by up-to 257.1x compared to the baseline while reducing COMPASS overheads by 200x and up to 327x |
Located in Amici (4th floor)
ICRC Technical Session 8: New Designs for Hardware Accelerators
16:25 | PRACO: A Photonic Residue-Number Architecture Support for Collective Operations PRESENTER: Jiaxin Peng ABSTRACT. Collective operations that involve multiple or all processes/threads within a parallel program are critical in multi-core chips and parallel systems, particularly for data-intensive applications. However, many existing electrical network systems struggle to efficiently support these operations, leading to significant overhead and reduced performance. The synergy between photonic technologies and the Residue Number System offers an efficient alternative for computing as the high-parallelism and the opportunity for compute-in-network. Here, we propose PRACO, a Photonic Residue-Number Architecture for Collective Operations. It establishes a phonic network at chip-level to accelerate communication, computation, and synchronization among multi-cores in the system. Additionally, its compute-in-network capability allows for performing computations during data transmission, which minimizes data movement and further reduces the latency and power consumption. Moreover, the network enables high-speed synchronization barriers, allowing faster coordination between cores and improved overall system efficiency. Compared to an electrical network, our simulation results show that PRACO achieves up to an order of magnitude speedup for collective operations, along with a significant decrease in energy consumption. |
16:45 | Accelerating PageRank Algorithmic Tasks with a new Programmable Hardware Architecture ABSTRACT. Addressing the growing demands of artificial intelligence (AI) and data analytics requires new computing approaches. In this paper, we propose a reconfigurable hardware accelerator designed specifically for AI and data-intensive applications. Our architecture features a messaging-based intelligent computing scheme that allows for dynamic programming at runtime using a minimal instruction set. To assess our hardware's effectiveness, we conducted a case study in TSMC 28nm technology node. The simulation-based study involved analyzing a protein network using the computationally demanding PageRank algorithm. The results demonstrate that our hardware can analyze a 5,000-node protein network in just 213.6 milliseconds over 100 iterations. These outcomes signify the potential of our design to achieve cutting-edge performance in next-generation AI applications. |
17:05 | SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance ABSTRACT. Large Language Models (LLMs) have shown great potential in automating code generation; however, their ability to generate accurate circuit-level SPICE code remains limited due to a lack of hardware-specific knowledge. To address this gap, we present SPICEPilot—a novel Python-based dataset generated using PySpice, along with its accompanying framework. This marks a significant step forward in automating SPICE code generation across a variety of circuit configurations. Our framework automates the creation of SPICE simulation scripts, introduces standardized benchmarking metrics to evaluate LLM's ability for circuit generation, and outlines a roadmap for integrating LLMs into the hardware design process. By bridging the gap between advanced software tools and hardware design, SPICEPilot accelerates innovation in circuit solutions. |
17:25 | Mixture of Experts with Mixture of Precisions for Tuning Quality of Service ABSTRACT. The increasing demand for deploying large Mixtureof- Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 3.81 to 4.00, 13.59 to 14.17, and 7.24 to 7.40 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important. |
Tutorial: Cryogenic Computing