ICRC 2024: IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING 2024
PROGRAM FOR TUESDAY, DECEMBER 17TH
Days:
previous day
all days

View: session overviewtalk overview

08:00-08:30 Breakfast

Breakfast (Sponsored by Vaire Computing: vaire.co)

08:30-09:40 Session 12

ICRC Technical Session 4: Exploiting Emerging Devices for Complex Logic 

08:30
Development of power clock MEMs for applications in reversible quasi-adiabatic logic

ABSTRACT. Computing using quasi-adiabatic Reversible Logic (RL) [1] offers significant energy savings, making it highly promising for modern CMOS technology. In conventional computing, much of the energy is lost as heat due to the irreversible nature of most logic operations. The key advantage of the RL paradigm lies in its potential for energy recycling: instead of dissipating energy as heat, the system can recover and reuse it in subsequent cycles. If effectively implemented, this energy recycling could revolutionize computing efficiency. However, achieving efficient energy recycling (ER) in RL computing remains a complex challenge. In addition to logic that enables energy recovery, power supplies or power clocks are required that can actually supply energy and then later take back that energy and reuse it. This is not trivial, but one potential solution involves using MEMS resonators [2] to supply and then recycle energy. In essence, the RL CMOS circuit acts as the RC component of a parallel RLC resonator, where parameters such as resonant frequency and energy storage are governed by the MEMS element. The crucial feature of this system (MEMS resonator + RL logic) is its ability to transfer energy between its "inductive" and "capacitive" components, allowing the energy of bits used for computation (stored in CMOS capacitors) to be recycled. In operation, the resonator and RL logic system are driven by a clock oscillator, and once start-up transient processes are complete, the system requires additional energy only to replace losses due to heating and information erasure in any irreversible parts of the logic. AlN contour mode resonators, Fig. 1a are fabricated in 1 µm thick AlN films sputtered onto silicon substrates. Electrical characterization results are shown in Fig. 1b. Using the parameters extracted from these MEMS resonators, we conducted simulations in Keysight Advanced Design System (ADS) to investigate their use in powering capacitive loads representing RL. As shown in Fig. 1c, the oscillator supply current decreases by 80% for capacitive loads up to 40 fF, which is equivalent to the capacitance of around 50 CMOS FETs fabricated in 90 nm technology. Our findings demonstrate that a significant amount of the energy used in RL logic can be recycled, highlighting the feasibility of implementing complementary RL logic with on-chip resonators.

08:40
Thermodynamic Algorithms for Quadratic Programming

ABSTRACT. Thermodynamic computing has emerged as a promising paradigm for accelerating computation by harnessing the thermalization properties of physical systems. This work introduces a novel approach to solving quadratic programming problems using thermodynamic hardware. By incorporating a thermodynamic subroutine for solving linear systems into the interior-point method, we present a hybrid digital-analog algorithm that outperforms traditional digital algorithms in terms of speed. Notably, we achieve a polynomial asymptotic speedup compared to conventional digital approaches. Additionally, we simulate the algorithm for a support vector machine and predict substantial practical speedups with only minimal degradation in solution quality. Finally, we detail how our method can be applied to portfolio optimization and the simulation of nonlinear resistive networks.

09:00
Hyperdimensional Computing Provides Computational Paradigms for Oscillatory Systems

ABSTRACT. The increasing difficulty in continued development of digital electronic logic has led to a renewed interest in alternative approaches. Oscillatory computing is one such approach that leverages alternative physical systems and computation strategies, but it lacks high-level paradigms for system design and programming. We address this gap by describing a model based on hyperdimensional computing that serves as an "instruction set" to integrate oscillatory networks into algorithms for real-valued computing. The expressiveness and compositionality of these instructions allow oscillatory systems to implement both common tasks and novel functions, providing a clear computational role for many emerging hardware devices. We detail the computational primitives of this system, prove how they can be executed via oscillatory systems, quantify the performance of these operations, and apply them to execute multiple tasks including compression, factorization, and classification.

09:20
Exploration of the viability of TiN/TiOX ReRAM in Computational Random-Access Memory (CRAM)

ABSTRACT. In-memory computing is a promising solution for solving the von-Neumann bottleneck. In particular, computational random-access memory (CRAM) is a promising form of in-memory computing where cascading logic operations can be performed directly within the memory array. However, a recent experiment utilizing magnetoresistve devices as the memory element in CRAM only gave the correct answer in 63 % of trials. One way to improve the accuracy is to build CRAM cells using resistive devices with larger ON/OFF ratios. In this study, we explore the performance of CRAM using resistive random-access memory (ReRAM) cells. Using experimental data obtained from various TiN/TiOX-based ReRAM devices in Monte Carlo simulations, we determine that the performance of the full adder operation using ReRAM based CRAM is still subject to the same inaccuracies as CRAM that utilizes magnetoresistive devices. However, our analysis reveals that by reducing the write voltages and removing the effects of complementary resistive switching in the ReRAM devices, 100 % accuracy over 10,000 trials can be achieved.

09:40-09:55Coffee Break

Located in Amici (4th floor)

09:55-10:55 Session 13

ICRC Technical Session 5:Emerging Devices for Scalable Computing

09:55
A Hybrid RRAM-CMOS Platform for Prototyping Digital and Analog Projects

ABSTRACT. RRAM provides fantastic opportunities to integrate logic and memory tightly and allows low-power computing, in particular for Artificial Intelligence models and neuromorphic computing. Unfortunately, the behavior of RRAM is highly complex and partly stochastic: typical device models utilize lookup tables, are simple analytic models, or compute-intensive physics-based models. Yet they never provide a truly accurate prediction of their behavior. It is therefore essential to prototype computing concepts involving RRAM experimentally. We present a fabricated and tested prototyping platform, associating an array of 8,192 hafnium-oxide-based RRAM and a collection of CMOS periphery circuits. Our platform is multi-paradigm, which permits prototyping a wide range of both digital and analog projects.

10:05
Evaluation of Ferroelectric Devices for Neuromorphic Applications

ABSTRACT. Hafnia-based ferroelectric devices are promising for future embedded electronics, but their performance is largely untested when integrated into neurormorphic computing architectures. In this work, we investigate how these devices perform in a variety of classification and control tasks. Within our framework, we show that different read voltages on these devices can have an impact on performance, but overall, these devices as synapses can achieve comparable or better performance within a neuromorphic system than neuromorphic systems with more continuous, higher precision synaptic weights.

10:25
Accurate and Efficient Reservoir Computing with a Multifunctional Proton-Copper ECRAM
PRESENTER: Caroline Smith

ABSTRACT. The prevalent adoption and scaling of artificial neural networks (ANNs) has led to development of alternative hardware architectures, such as analog in-memory computing (AIMC), which have energy efficiency advantages over traditional von Neumann architectures [2]. AIMC systems based on resistive memory (or memristor) arrays enable parallel computations, providing high computational capacity and energy efficiency [1], [2]. Memristor-based AIMC architectures are suitable for implementing recurrent neural networks (RNNs) which are used for machine learning tasks related to time series and sequences. However, memristor-based RNN implementations face challenges, such as convergence issues, vanishing gradients, and sensitivity to memristor conductance variability and system noise [2]. These challenges are circumvented by a class of RNN called reservoir computing (RC) systems. A general RC (Fig. 1 (a)) system includes an input layer, a reservoir, and a readout layer [2], [3]. The RC system is configured to receive a temporal signal at the input layer. The input layer passes the signal (Win) to the reservoir, which includes a recursive network configured to exhibit dynamic behavior [2]. Reservoir dynamics enable the mapping of the signal onto a higher-dimensional space, making its features linearly separable, improving classification and prediction capabilities [3]. Unlike a traditional RNN, the reservoir’s weights (WR) are random and fixed, making it more suitable for hardware implementation. The reservoir output is found using output weights (Wout), with only the Wout being trained during the training process, thereby making the RC system training more efficient and less complex than other computing methods [2]. Once this training is completed, devices with static behavior can be used to represent the Wout. Typical outputs include classification or prediction of the input signal. In memristor-based AIMC RC hardware implementations, dynamic (i.e. volatile) devices serve as the reservoir. The reservoir’s weights correspond to the conductance of these devices; a high (low) conductance state translates to a large (small) weight [2]. The resulting output of these dynamic devices in the reservoir network can then be received by the readout layer of the RC system, which is made up of an array of devices exhibiting static behavior. During inference, the devices in the readout layer retain specific conductance values that represent the trained weights, allowing both the trained weights and the summation of all reservoir states to be evaluated, giving the final RC output [2]. RC systems typically use two terminal resistive memory to represent the weights. In this case, two different device behaviors are required for the reservoir and the readout layer. Specifically, a fast decay is required for the former while the latter requires a highly stable state [2]. While this configuration has been successful in achieving high energy efficiency and accuracy, the use of two types of devices reduces system flexibility and requires a complex hardware integration scheme. Recently, an electrochemical random-access memory device (ECRAM), illustrated in Fig. 2(a), has been developed which exhibits both dynamic (Fig. 2(b)) and static behavior (Fig. 2(c)-(d)) required for performing tasks such as prediction and classification in a fully analog RC system. Typical ECRAM cells modulate channel conductance using a single ion species, such as protons or oxygen vacancies. Our device instead utilizes two ion species: protons, which hold a transient state with short retention and copper ions, which hold a state that is stable over long periods of time. In this paper, accuracy of a dual-ion ECRAM-based AIMC RC system is evaluated in simulations based on physical data representing both the copper and proton controlled states, representing the readout layer and reservoir. Experimentally characterized nonidealities including drift and programming error, noise, and dynamic decay over time, have been measured and simulated using a diffusion-drift model. The accuracy assessment utilizes our custom code to model the dynamic layer accuracy and CrossSim for the output layer [4]. Temporal inputs, such as Mackey-Glass signals and electrocardiogram (ECG) data, are used to benchmark the system’s accuracy and efficiency. This includes classification and prediction of Mackey-Glass waveforms and arrhythmia detection in ECG.

REFERENCES [1] T. P. Xiao, et al , Applied Physics Reviews 7, 031301, 2020. [2] Y. Zhong, et al., Nature Electronics, vol. 5, no. 10, pp. 672–681, 2022. [3] G. Tanaka, et al., Nature Electronics, vol. 115, pp. 100-123, 2019. [4] T. P. Xiao et al, “CrossSim Inference Manual (v2.0)”, 2022.

10:35
Scalable Spintronic Synapses for Analog In-Memory Computing Based on Exchange-Coupled Nanostructures

ABSTRACT. The pursuit of high-performance and energy-efficient computing for data-intensive algorithms such as deep neural networks (DNN) opens up exciting opportunities for emerging non-volatile memories (NVM). Particularly, implementing such non-volatile memory units in crossbar arrays as weight matrix storage can provide highly parallel and efficient means of processing matrix-vector multiplications, providing synaptic functionality for the neuromorphic computing paradigm. While numerous memristive and phase-change device systems have been investigated for synaptic crossbar arrays, it remains challenging to provide robust and efficient device technology for multi-bit (analog) synapses. In this work, a multi-level spintronic device based on a magnetic tunnel junction (MTJ) device is proposed and studied. By integrating a standard MTJ free layer exchange coupled with a granular magnetic nanostructure, multiple near-continuous resistive states can be induced thanks to the distribution of the energy barrier among individual magnetic grains. Our simulation analysis demonstrated superior scalability with small variability compared to other means of multi-level devices. System-level simulation demonstrates that enabling 2-bit per cell MRAM crossbars leads up to 3.4x improvement in hardware efficiency while maintaining the inference accuracy.

10:55-11:05Coffee Break

Located in Amici (4th floor)

11:05-12:45 Session 14

ICRC Technical Session 6:Temporal & Spiking Neuromorphic Systems

11:05
Mixed Delay/Nondelay Embeddings Based Neuromorphic Computing with Patterned Nanomagnet Arrays

ABSTRACT. Patterned nanomagnet arrays (PNAs) have been shown to exhibit a strong geometrically frustrated dipole interaction. Some PNAs have also shown emergent domain wall dynamics. Previous works have demonstrated methods to physically probe these magnetization dynamics of PNAs to realize neuromorphic reservoir systems that exhibit chaotic dynamical behavior and high-dimensional nonlinearity. These PNA reservoir systems from prior works leverage echo state properties and linear/nonlinear memory capacities of component reservoir nodes to map and preserve the dynamical information of the input time-series data into nondelay spatial embeddings. Such mapping enables these PNA reservoir systems to imitate and predict/forecast the input time series data. However, these prior PNA reservoir systems are based solely on the nondelay spatial embeddings obtained at component reservoir nodes. As a result, they require a massive number of component reservoir nodes, or a very large spatial embedding (i.e., a high-dimensional spatial embedding) per reservoir node, or both, to achieve acceptable imitation and prediction accuracy. This requirement reduces the practical feasibility of such PNA reservoir systems. To address this shortcoming, we present a mixed delay/nondelay embeddings-based PNA reservoir system. Our system comprises a single PNA reservoir node with the ability to obtain a mixed delay/nondelay embeddings of the dynamical information of the time-series data applied at the input of the PNA reservoir node. Our analysis shows that when these mixed delay/nondelay embeddings are used to train a perceptron at the output layer, our reservoir system outperforms existing PNA-based reservoir systems for the imitation of NARMA 2, NARMA 5, NARMA 7, and NARMA 10 time series signals, and for the short-term and long-term prediction of the Mackey Glass time series signal.

11:25
A Configurable CPG Controller using Connectome based SNN on FPGA for Robot Locomotion

ABSTRACT. Movement control in autonomous robots requires low-power, real-time models, especially for bio-mimetic locomotion in challenging terrains. Human intervention is often impractical in such environments, making specialized neural networks like Central Pattern Generators critical for offloading computational resources. This paper presents a controller for a bipedal robot using a modified spiking neuron model, adapted for efficient deployment on field-programmable gate arrays. Key modifications were made to ensure lightweight, real-time performance. Additionally, we leverage a unique open-source neuromorphic software platform for the network design and deployment, making the technology accessible to developers aiming to implement autonomous robot locomotion.

11:45
Encoding Numbers with Graded Spikes

ABSTRACT. Neuromorphic computing is a promising avenue to improve processors past the limits imposed by the decline of Moore’s Law and Dennard Scaling. However, some neuromorphic computing workflows are heavily bottlenecked by passing data in-between a neuromorphic processor and a CPU because intermediate values need to be processed with arithmetic and Boolean operations. An active research field seeks to eliminate this bottleneck by executing arithmetic and Boolean operations on neuromorphic processors. Specifically, researchers encode numbers in spikes, then use hand-crafted spiking neural networks (SNNs) to perform the necessary operations on the encoded numbers. The performance of an encoding is often evaluated using addition as a proxy for the other arithmetic and boolean operations. However, how to effectively utilize graded spikes, which enable neurons to send and receive an integer value with each spike, is still an open question. Answering this question is challenging because we lack neuron models that support graded spikes. We propose the Overflow neuron model, which does support graded spikes, to mitigate this challenge. We then use the Overflow neuron to create three novel unsigned integer encodings and adder SNNs that take advantage of graded spikes. We compare how our graded spike-based encodings scale against existing encodings, demonstrating that using graded spikes can decrease the number of neurons, synapses, and spikes in SNNs. Finally, we estimate the hardware costs of using graded spikes and show that the reduction in neurons, synapses, and spikes from using graded spike-based encodings translates into a reduction in power consumption and latency. In summary, we utilize graded spikes to perform addition on neuromorphic processors and show that our graded spike-based methods outperform non-graded spike-based methods.

12:05
Harnessing Dendritic Computation for Neuromorphic Architectures (Invited)

ABSTRACT. Neuromorphic computing is an emerging paradigm inspired by the brain, offering a promising approach to enhance the computational efficiency of next-generation computing architectures. In nature, brains execute complex operations with significantly lower energy consumption compared to traditional computing methods. Current neuromorphic systems primarily focus on scalability, specifically by increasing the number of computational units (neurons) and the weighted connections between these units (synapses). However, achieving brain-like cognition and efficiency in future computing hardware requires greater functional complexity along with scalability.

 

In this presentation, I will discuss our research aimed at integrating dendrites for "compute-on-wire" capabilities within neuromorphic architectures. This integration seeks to enhance both the computational complexity—such as the number of programmable parameters and nonlinear dynamics—and the computational efficiency (energy per computation) of neural networks. I will showcase neuromorphic dendrite elements that can be applied across various applications, including neuroscience-inspired direction-selective circuits, a neural network featuring active dendrites that utilizes shunting inhibition, as well as demonstrate the advantages of incorporating dendrites into deep neural networks. To conclude, I will talk about co-design tools we have developed to leverage analog devices and circuits in these systems to design next-generation neuromorphic architectures that incorporate dendritic computation.

 

SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525.

12:25
The RISP Neuroprocessor - Open Source Support for Embedded Neuromorphic Computing

ABSTRACT. Neuromorphic computing offers exciting possibilities for embedded systems and edge-computing, due to its combination of computational ability and low size, weight, and power. However, open source solutions for embedded neuromorphic computing are lacking. In this paper, we present open source support for the RISP neuroprocessor, which features simple integrate-and-fire neurons and synapses with discrete delays. There are two software repositories to support RISP -- one that provides simulation and network manipulation, and one that implements RISP networks on FPGAs. We detail each of these, discuss capacity and performance, and present examples. Highlights include the large networks supported by commodity FPGAs, with tens of thousands of neurons and synapses. The UART communication is a clear bottleneck; however there are multiple straightforward avenues for improving communication.

12:45-13:45Lunch Break

Above Ash (16th floor)

13:45-14:45 Session 15

Keynote 

13:45
Diffusive and drift memristors for neuromorphic and analog computing

ABSTRACT. Memristors can be broadly categorized into diffusive memristors and drift memristors, based on their reset switching mechanisms. Diffusive memristors reset via the diffusion of mobile species under zero electrical bias, exhibiting dynamics that closely mimic biological ion behavior. This unique characteristic enables efficient neuromorphic computing. In contrast, drift memristors reset through the drift of mobile ions under an electric field, offering highly stable analog resistance levels ideal for constructing neural networks for analog computing. This presentation will highlight recent advancements in memristor devices, arrays, and their application demonstrations, showcasing their potential in emerging computing paradigms.

14:45-14:55Coffee Break

Located in Amici (4th floor)

14:55-16:15 Session 16

ICRC Technical Session 7:New Frontiers in Quantum Computing

14:55
Superconnectors: A Latency Insensitive Approach to SFQ Design

ABSTRACT. Superconducting logic offers the potential for in- credibly high speed, low power computation, due to its gate level clocking and a lack of resistive losses. This results in large, statically scheduled pipelines with clock speeds up to 50 GHz, offering orders of magnitude better throughput than modern digital systems. To fully utilize these pipelines data must be very carefully orchestrated both outside of the system and within the pipeline itself. However, memory systems, data dependent operations, and IO introduce timing uncertainty which can cause a significant degradation in throughput and utilization. In the digital domain a rich set of latency insensitive design (LID) principles exist for this problem, but the tight combinational feedback inherent to their operation introduces a new set of challenges when integrated with single flux quanta (SFQ) based designs.

We investigate these challenges by examining two classical methods for LID, the ready/valid protocol and LID-1ss. We show how a naive, direct implementation of these protocols removes much of the benefits of superconducting logic. We then explore how LID-1ss can be optimized for SFQ, resulting in better throughput and a simpler design. However, this opti- mized version still significantly reduces the maximum potential throughput, motivating us to propose Superconnectors: a novel SFQ-specific architecture for LID. Superconnectors leverage passive transmission line buffers, asynchronous race logic control signals, and batched transactions for a hardware design that has minimal impact on the underlying logic and throughput. We then demonstrate Superconnectors with a merge operation on an array of multipliers, introducing 44% less pipeline stall than the optimized LID-1ss approach with no impact to the achievable clock speed of the underlying module.

15:15
Neural Network Enhanced Robustness for Noisy Quantum Applications

ABSTRACT. Due to the limitations of current NISQ systems, error mitigation strategies are under development to alleviate the negative effects of error-inducing noise on quantum applications. This work proposes the use of machine learning (ML) as an error mitigation strategy, using ML to identify the accurate solutions to a quantum application in the presence of noise. Methods of encoding the probabilistic solution space of a basis-encoded quantum algorithm are researched to identify the characteristics which represent good ML training inputs. A multilayer perceptron artificial neural network (MLP ANN) was trained on the results of 8-state and 16-state basis-encoded quantum applications both in the presence of noise and in noise-free simulation. It is demonstrated using simulated quantum hardware and probabilistic noise models that a sufficiently trained model may identify accurate solutions to a quantum applications with over 90% precision and 80% recall on select data. The model makes confident predictions even with enough noise that the solutions cannot be determined by direct observation, and when it cannot, it can identify the inconclusive experiments as candidates for other error mitigation techniques.

15:35
The Dilemma of Random Parameter Initialization and Barren Plateaus in Variational Quantum Algorithms
PRESENTER: Muhammad Kashif

ABSTRACT. This paper presents an easy-to-implement approach to mitigate the challenges posed by barren plateaus (BPs) in randomly initialized parameterized quantum circuits (PQCs) within variational quantum algorithms (VQAs). Recent state-of-the-art research is flooded with a plethora of specialized strategies to overcome BPs, however, our rigorous analysis reveals that these challenging and resource heavy techniques to tackle BPs may not be required. Instead, a careful selection of distribution range to initialize the parameters of PQCs can effectively address this issue without complex modifications. We systematically investigate how different ranges of randomly generated parameters influence the occurrence of BPs in VQAs, providing a straightforward yet effective strategy to significantly mitigate BPs and eventually improve the efficiency and feasibility of VQAs. This method simplifies the implementation process and considerably reduces the computational overhead associated with more complex initialization schemes. Our comprehensive empirical validation demonstrates the viability of this approach, highlighting its potential to make VQAs more accessible and practical for a broader range of quantum computing applications. Additionally, our work provides a clear path forward for quantum algorithm developers seeking to mitigate BPs and unlock the full potential of VQAs.

15:55
COMPASS: Compiler Pass SelectionFor Improving Fidelity Of NISQ Applications

ABSTRACT. Noisy qubit devices limit the fidelity of programs executed on near-term or Noisy Intermediate Scale Quantum (NISQ) systems. The fidelity of NISQ applications can be improved by using various optimizations during program compilation (or transpilation). These optimizations or passes are designed to minimize circuit depth (or program duration), steer more computations on devices with lowest error rates, and reduce the communication overheads involved in performing two-qubit operations between non-adjacent qubits. Additionally, standalone optimizations have been proposed to reduce the impact of crosstalk, measurement, idling, and correlated errors. However, our experiments using real IBM quantum hardware show that using all optimizations simultaneously often leads to sub-optimal performance and the highest improvement in application fidelity is obtained when only a subset of passes are used. Unfortunately, identifying the optimal pass combination is non-trivial as it depends on the application and device specific properties. In this paper, we propose COMPASS, an automated software framework for optimal Compiler Pass Selection for quantum programs. COMPASS uses dummy circuits that resemble a given program but is composed of only Clifford gates and thus, can be efficiently simulated classically to obtain its correct output. The optimal pass set for the dummy circuit is identified by evaluating the efficacy of different pass combinations and this set is then used to compile the given program. Our experiments using real IBMQ machines show that COMPASS improves the application fidelity by 4.3x on average and by upto 248.8x compared to the baseline. However, the complexity of this search scales exponential in the number of compiler steps. To overcome this drawback, we propose Efficient COMPASS (E-COMPASS) that leverages a divide-and-conquer approach to split the passes into sub-groups and exhaustively searching within each sub-group. Our evaluations show that E-COMPASS improves fidelity by 3.0x on average and by up-to 257.1x compared to the baseline while reducing COMPASS overheads by 200x and up to 327x

16:15-16:25Coffee Break

Located in Amici (4th floor)

16:25-17:45 Session 17

ICRC Technical Session 8: New Designs for Hardware Accelerators

Chair:
16:25
PRACO: A Photonic Residue-Number Architecture Support for Collective Operations
PRESENTER: Jiaxin Peng

ABSTRACT. Collective operations that involve multiple or all processes/threads within a parallel program are critical in multi-core chips and parallel systems, particularly for data-intensive applications. However, many existing electrical network systems struggle to efficiently support these operations, leading to significant overhead and reduced performance. The synergy between photonic technologies and the Residue Number System offers an efficient alternative for computing as the high-parallelism and the opportunity for compute-in-network.

Here, we propose PRACO, a Photonic Residue-Number Architecture for Collective Operations. It establishes a phonic network at chip-level to accelerate communication, computation, and synchronization among multi-cores in the system. Additionally, its compute-in-network capability allows for performing computations during data transmission, which minimizes data movement and further reduces the latency and power consumption. Moreover, the network enables high-speed synchronization barriers, allowing faster coordination between cores and improved overall system efficiency. Compared to an electrical network, our simulation results show that PRACO achieves up to an order of magnitude speedup for collective operations, along with a significant decrease in energy consumption.

16:45
Accelerating PageRank Algorithmic Tasks with a new Programmable Hardware Architecture

ABSTRACT. Addressing the growing demands of artificial intelligence (AI) and data analytics requires new computing approaches. In this paper, we propose a reconfigurable hardware accelerator designed specifically for AI and data-intensive applications. Our architecture features a messaging-based intelligent computing scheme that allows for dynamic programming at runtime using a minimal instruction set. To assess our hardware's effectiveness, we conducted a case study in TSMC 28nm technology node. The simulation-based study involved analyzing a protein network using the computationally demanding PageRank algorithm. The results demonstrate that our hardware can analyze a 5,000-node protein network in just 213.6 milliseconds over 100 iterations. These outcomes signify the potential of our design to achieve cutting-edge performance in next-generation AI applications.

17:05
SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance

ABSTRACT. Large Language Models (LLMs) have shown great potential in automating code generation; however, their ability to generate accurate circuit-level SPICE code remains limited due to a lack of hardware-specific knowledge. To address this gap, we present SPICEPilot—a novel Python-based dataset generated using PySpice, along with its accompanying framework. This marks a significant step forward in automating SPICE code generation across a variety of circuit configurations. Our framework automates the creation of SPICE simulation scripts, introduces standardized benchmarking metrics to evaluate LLM's ability for circuit generation, and outlines a roadmap for integrating LLMs into the hardware design process. By bridging the gap between advanced software tools and hardware design, SPICEPilot accelerates innovation in circuit solutions.

17:25
Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

ABSTRACT. The increasing demand for deploying large Mixtureof- Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 3.81 to 4.00, 13.59 to 14.17, and 7.24 to 7.40 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important.

17:45-18:00Coffee Break
18:00-18:15 Session 18

Conference Wrap-Up & Awards

18:15-18:45 Session 19

Tutorial: Cryogenic Computing 

18:15
Making CMOS Cool Again – Cryogenic CMOS for High Performance Computing

ABSTRACT. The drive for faster, more energy-efficient computing systems has led to extensive exploration of emerging technologies that go beyond traditional CMOS scaling. Cryogenic computing, which leverages ultra-low temperature environments (typically below 77K) to enhance the performance of electronic devices, has recently gained attention as a promising post-CMOS candidate for high-performance computing (HPC). By operating at cryogenic temperatures, key computing technologies—such as superconducting circuits, quantum systems, and CMOS variants optimized for low-temperature operation—can provide significant improvements in speed, energy efficiency, and reliability. This tutorial will provide a comprehensive overview of cryogenic computing technologies, focusing on their potential as an alternative to conventional CMOS in HPC applications. Participants will gain insight into the advantages of cryogenic operation, the design challenges associated with low-temperature environments, and the current state of research and development in this field.