next day
all days

View: session overviewtalk overview

16:30-18:00 Session 2A: Accelerated Machine Learning (1)
*Eciton: Very Low-Power LSTM Neural Network Accelerator for Predictive Maintenance at the Edge

ABSTRACT. This paper presents Eciton, a very low-power LSTM neural network accelerator for low-power edge sensor nodes, demonstrating real-time processing on predictive maintenance applications with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the LSTM model parameters to fit in a low-cost, low-power Lattice iCE40 UP5K FPGA. Eciton demonstrates real-time processing with minimal loss of accuracy on two predictive maintenance scenarios with differing characteristics, while achieving superior power efficiency against existing work. We also show that the addition of this accelerator actually reduces the power budget of the sensor node by reducing power-hungry wireless transmission.The resulting power budget of the sensor node is small enough to be powered by a power harvester, potentially allowing it to run indefinitely without a battery or periodic maintenance.

FixyFPGA: Efficient FPGA Accelerator for Deep Neural Networks with High Element-Wise Sparsity and without External Memory Access

ABSTRACT. Convolutional neural networks (CNNs) have shown great success in real-time computer vision tasks. CNNs involve a high amount of iterative computations and demands highly efficient computing platforms. In recent years, researchers have explored a diverse range of software and hardware optimizations to accelerate CNN inference. High power consumption of GPU and lack of reconfigurability of ASIC has promoted FPGA to be a favorable platform to efficiently accelerate these inference tasks. Various FPGA-based CNN accelerators have been proposed with low precision weights, pruned networks, exploiting sparsity and using hybrid strategies. However, most of these proposed works require the off-chip memory to store the parameters and DSP blocks to perform the computation. Constant communication with off-chip memory limits the performance of the accelerator in two ways: 1) the overall throughput of the design may be limited by memory bandwidth, and 2) off-chip memory access adds a large amount of energy consumption. In this work, we propose a novel approach to accelerate CNN inference without using off-chip memory. The weights of the trained CNN network is fixed and used as a constant coefficient of the multiplier unit. Convolution is performed by streaming the input activations to these constant coefficient multiplier arrays. With the fully quantized 4-bit sparse MobileNet-V1 and SqueezeNet, we analyzed the performance of the proposed scheme with both image classification tasks and object detection tasks.

*An FPGA-based MobileNet Accelerator Considering Network Structure Characteristics

ABSTRACT. Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. MobileNet, as a representative of compact models, can reduce the amount of parameters and computation. A high-performance inference accelerator on FPGA for MobileNet is proposed in this paper. With respect to the three types of convolution operations, multiple parallel strategies are exploited and the corresponding hardware structures such as input buffer and configurable adder tree are designed. With respect to the bottleneck block, a dedicated architecture is proposed to reduce data transmission time. In addition, a hardware padding scheme to improve the efficiency of padding is proposed. The accelerator implemented on Virtex-7 FPGA reaches 70.8% Top-1 accuracy under 8-bit quantization. The accelerator achieves 302.3 FPS and 181.8 GOPS, which obtains 22.7x, 3.9x and 1.4x speedup compared to the implementations in Snapdragon 821 CPU, i7-6700HQ CPU and GTX 960M GPU, respectively.

A Customizable Domain-Specific Memory-Centric FPGA Overlay for Machine Learning Applications

ABSTRACT. Field Programmable Gate Arrays (FPGAs) are poised to support the migration of Machine Learning (ML) into highly networked and distributed Internet-of-Things (IoT) edge devices. The malleability of the compute fabric allows the creation of architectures that match the spectrum of low-precision ML network topologies needed by IoT edge applications. However, poor design productivity and lack of portability remain major obstacles that will hinder the deployment of FPGAs at the IoT edge. To tackle these challenges, this paper presents a software-programmable domain-customizable overlay that brings near-software levels of designer productivity with performance levels that compete with custom designs. The overlay adopts a standard Instruction Set Architecture (ISA) which can be compiled to form any of the three classes of machine learning algorithms: Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). This allows the overlay to be retargeted in the field without the need to resynthesize the hardware. Performance results show that the domain-specific customizable overlay achieves 1.3-8.0 speedup over custom designs. Notably, the overlay is programmable, allowing rapid design space exploration without the need to resynthesize when changing ML algorithms on the FPGA.

DeepFire: Acceleration of Convolutional Spiking Neural Network on Modern Field Programmable Gate Arrays

ABSTRACT. Deep convolutional neural network (CNN) has been adopted for various image classification tasks for its best-in-class accuracy achievable by various hyper-parameter tuning and the network architecture search. On contrary, spiking neural networks (SNN) with their ‘integrate and fire’ (I&F) neurons replace the hardware-intensive multiply-accumulate (MAC) operations in CNNs with the add-accumulate operations — not only making it easy to implement on FPGAs but also opening up the opportunities for energy-efficient hardware acceleration. In this paper, we propose DeepFire — the high-performance RTL IP — for accelerating convolutional SNN inference. The IP exploits various resources available on modern FPGAs, and it outperforms existing SNN implementations by more than 10in terms of both frame per second (FPS) and performance per watt (FPS/Watt). Our design achieves 46.7kFPS and 27.6kFPS on MNIST and CIFAR-10/SVHN datasets with 99.0% and 81.8%/93.1% accuracies respectively with a Xilinx VU9P FPGA running at 425MHz. This can be further increased to as high as 513MHz for smaller network models.

MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks

ABSTRACT. Low precision quantization in convolutional neural network (CNN) inference has been proved effective for reducing computation complexity and bandwidth requirement. Mixed precision CNNs manage to benefit from low precision while maintaining accuracy. In this paper, we propose a Mixed Precision FPGA-based Overlay Processor ({\bf MP-OPU}) to fully leverage the advantages of mixed precision for both conventional and lightweight CNNs. The micro-architecture of {\bf MP-OPU} considers sharing of computation core with mixed precision weights and activations to improve computation efficiency. In addition, run-time scheduling of external memory access and data arrangement are optimized to further leverage the advantages of mixed precision data representation. Our experimental results show that {\bf MP-OPU} reaches 4.92 TOPS peak throughput when implemented on the Xilinx VC709 FPGA board (with all DSPs configured to support 2-bit multipliers). Moreover, {\bf MP-OPU} achieves 12.9$\times$ latency reduction and 2.2$\times$ better throughput/DSP for conventional CNNs while 7.6$\times$ latency reduction and 2.9$\times$ better throughput/DSP for lightweight CNNs, all on average compared with existing FPGA accelerators/processors, respectively. To the best of our knowledge, this is the first in-depth study on mixed precision FPGA-based overlay processor for both conventional and lightweight CNNs.

16:30-18:00 Session 2B: Security & Low Power
CHOICE – A Tunable PUF-Design for FPGAs

ABSTRACT. FPGA-based Physical Unclonable Functions (PUFs) have emerged as a viable alternative to permanent key storage by turning inaccuracies during the manufacturing process of a chip into a unique, FPGA-intrinsic secret. However, many fixed PUF designs may suffer from unsatisfactory statistical properties in terms of uniqueness, uniformity, and robustness. Moreover, a PUF signature may alter over time due to aging or changing operating conditions, rendering a PUF insecure in the worst case. As a remedy, we propose CHOICE, a novel class of FPGA-based PUF designs with tunable uniqueness and reliability characteristics. By the use of addressable shift registers available on an FPGA, we show that a wide configuration space for adjusting a device- specific PUF response is obtained without any sacrifice of randomness. In particular, we demonstrate the concept of address-tunable propagation delays, whereby we are able to increase or decrease the probability of obtaining ’1’s in the PUF response. Experimental evaluations on a group of six 28 nm Xilinx Artix-7 FPGAs show that CHOICE PUFs provide a large range of configurations to allow a fine-tuning to an average uniqueness between 49% and 51%, while simultaneously achieving bit error rates below 1.5%, thus outperforming state-of-the-art PUF designs. Moreover, with only a single FPGA slice per PUF bit, CHOICE is one of the smallest PUF designs currently available on FPGAs.

Power-Aware Computing Systems on FPGAs: A Survey

ABSTRACT. One of the major concerns of embedded systems is power-aware computing in terms of battery-operated devices. The power dissipation of such systems is usually considered as a hardware problem but it can be solved effectively through hardware and software implementations of power saving techniques. Such algorithms have proven to be a promising approach to adjust dynamically the power consumption of embedded systems among feasible regions. One of the popular power saving techniques is Dynamic Voltage and Frequency Scaling. In order to reduce the power consumption, it is highly important to have an accurate and fast power monitoring service and satisfactory controllability of the embedded platform. In this work, we survey energy-aware computing platforms in the execution of power-saving algorithms based on different application domains. It intends to summarize recently published research papers related to power-aware computing architectures using FPGAs. Besides, we identify the trends and highlight important future directions for power management techniques and power monitoring services.

HLS-Based HW/SW Co-Design of the Post-Quantum Classic McEliece Cryptosystem

ABSTRACT. While quantum computers are rapidly becoming more powerful, the current cryptographic infrastructure is imminently threatened. In a preventive manner, the U.S. National Institute of Standards and Technology (NIST) has initiated a process to evaluate quantum-resistant cryptosystems, to form the first post-quantum (PQ) cryptographic standard. Classic McEliece (CM) is one of the most prominent cryptosystems considered for standardization in NIST’s PQ cryptography contest. However, its computational cost poses notable challenges to a big fraction of existing computing devices. This work presents an HLS-based, HW/SW co-design acceleration of the CM Key Encapsulation Mechanism (CM KEM). We demonstrate significant maximum speedups of up to 55.2×, 3.3×, and 8.7× in the CM KEM algorithms of key generation, encapsulation, and decapsulation respectively, comparing to a SW-only scalar implementation.

Modeling Attack Resistant Arbiter PUF with Time-Variant Obfuscation Scheme

ABSTRACT. As a promising lightweight hardware security primitive, physical unclonable function (PUF) have a good application prospect in the field of information security. Strong PUF represented by arbiter PUF is suitable for the authentication of resource-constrained devices. However, conventional arbiter PUF is vulnerable to modeling attacks due to its linear structure. In this paper, we propose an arbiter PUF with time-variant obfuscation scheme (TVO-APUF), which feeds the external random challenges into the linear feedback shift register (LFSR) module to determine the real challenge of underlying arbiter PUF, thus obfuscating the linear mapping relationship between challenge and response, leading to significant resistance to modeling attacks. In addition, LFSR module with low hardware overhead can be updated at any time to prevent reply attack. We implement a 48-stage TVO-APUF on Xilinx Spartan-6 FPGA board. The experimental results show that the proposed TVO-APUF can effectively resist modeling attacks such as logistic regression (LR), support vector machine (SVM) and evolutionary strategy (ES) with a maximum prediction rate of 53% and slight effects on uniformity, stability and uniqueness.

EnergyNN: Energy estimation for Neural Network Inference tasks on DPUs

ABSTRACT. Convolutional Neural Networks (CNNs) are increasingly becoming popular in embedded applications. Hardware designers have proposed various accelerators to speed up the execution of CNNs on embedded platforms. Deep Learning Processor Unit (DPU) is one such generic CNN accelerator for Xilinx platforms that can execute any CNN on one or more DPUs configured on a FPGA. In a period of rapid growth in CNN algorithms and the availability of multiple configurations of CNN accelerators (like DPU), the design space is fast expanding. These CNNs show significant tradeoffs in execution time, power consumption and application performance measured in terms of accuracy. We propose a methodology for energy estimation of a CNN running on a DPU. We build a model based on layerwise measurement of energy consumption for a few CNNs and use this for the prediction of other unseen CNNs. We evaluate our approach using 16 different CNNs with an average prediction error of 9.8%. This approach can be used in various scheduling applications where one can choose from multiple CNNs based on its run-time and performance. We demonstrate its applicability in an object detection scenario while being deployed on a drone.

FPGA Hardware Acceleration Framework for Anomaly-based Intrusion Detection System in IoT

ABSTRACT. This study proposes a versatile framework for real-time Internet of Things (IoT) network intrusion detection using artificial neural network (ANN) on heterogeneous hardware. With the increase in the volume of exchanged data, IoT networks' security has become a crucial issue. Anomaly-based intrusion detection systems (IDS) using machine learning have recently gained increased popularity, due to their generation ability to detect new attacks. However, the deployment of anomaly-based AI-assisted IDS for IoT devices is computationally expensive. An efficient hierarchical decision-making for IDS is proposed and evaluated on the new IoT-23 dataset, with improved accuracy over the software-based methods. The inference engine is implemented on the Xilinx FPGA System on a Chip (SoC) hardware platform for high performance, high accuracy attack detection (more than 99.43%). For the resulting implemented design, the processing time of the ANN model on FPGA with an xc7z020clg400 device is 6.6 times and 40.5 times faster than GPU Quadro M2000 and CPU E5-2640 2.60GHz, respectively.

19:00-20:30 Session 3A: Applications
*End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression

ABSTRACT. Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSD-MobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13μs to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5× higher throughput and 4.4× lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.

Performance assessment of FPGAs as HPC accelerators using the FER benchmark
PRESENTER: Enrico Calore

ABSTRACT. Hardware accelerators are nowadays very common in HPC systems, and GPUs are playing a major role in this. More recently, also FPGAs started to be adopted in few data centers to accelerate specific workloads and it is expected that in the next future they will be available also in general purpose HPC systems. FPGAs are already known to provide interesting performances in several application fields, but to estimate their expected performance in the context of typical HPC workloads is not straightforward. To ease this task, in this paper we present FER (FPGA Empirical Roofline), a benchmarking tool able to empirically estimate the computing throughput and memory bandwidth of an FPGA, when used as a hardware accelerator in the context of HPC. Using the widely known Roofline Model as a theoretical foundation, FER allows to measure FPGA compute performance and bandwidth upper-bounds, allowing to estimate expected performances of function kernels, developed using high level synthesis tools, according to their arithmetic intensity. We developed FER using two different high level paradigms: OmpSs@FPGA and the Xilinx Vitis workflow, which are both promising approaches for HPC applications porting, enabling exploitation of FPGAs as hardware accelerators. In this paper we describe the theoretical model on which the FER benchmark relies, we describe its implementation and we provide performance results measured on a Xilinx Alveo U250 FPGA.

A High-performance Open-channel Open-way NAND Flash Controller Architecture

ABSTRACT. NAND-Flash-based SSDs have been widely employed in diverse computing domains and storage systems due to their higher performance and lower power consumption than HDDs. There have been various studies to explore the internal parallelism inside SSDs, including the channel-way-plane levels of interleaving and the cache mode pipelining. However, most studies are based on simulators or focus on part of the parallelism. In this paper, we present a high-performance open-channel open-way NAND Flash controller supporting all the above parallelism. Several architecture innovations are proposed to improve performance and resource efficiency. Firstly, the controller exposes the multi-channel, multi-way topology with a queue-based asynchronous interface for each way. Secondly, a dual-level command scheduler is integrated to enable the fine-grained way-level interleaving, plane-level interleaving, and cache mode pipelining. Finally, four finite state machines are designed for the classified Flash command groups. Evaluated on an FPGA platform, the maximum bandwidth can reach 1.2GB/s, accounting for 93% of the theoretical bandwidth, 13% higher than the bandwidth utilization of other Flash controllers. The minimum latencies for the page reading and programming are 119μs and 2ms respectively, which can be speeded up by 1.9x and 3.1x on average with the multi-level parallelism.

Communication-optimized micro-architecture to compute Xcorr scores for peptide identification

ABSTRACT. Database search algorithms, that deduce peptides from mass spectrometry (MS) data, have tried to improve their computational efficiency to accomplish larger, and more complex systems biology studies. Many of these database search algorithms compare MS data against peptide database by computing cross-correlation (Xcorr) score for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Recent advances in MS instrumentation now enable generation of a large volume of MS data, and modern experiments require searching against enormous search space. Recent research has shown that communication costs of computing these scores are a major cost and has been neglected in the design of accelerator-based architectures; which has led to inefficient DRAM accesses, and poor input locality due to the irregular memory access patterns. In this paper, we propose a communication optimized micro-architecture to compute Xcorr score, which utilizes efficient local cache and peptide pre-fetching technique to reduce DRAM accesses, and a peptide broadcast bus to allow input reuse. Furthermore, we implemented an efficient bus arbitration scheme to minimize synchronization cost between parallel processing elements. Our simulation results show that the proposed micro-architecture performs 3.1x better than other state-of-the-art architectures, and 148x better than a CPU implementation running on a 3.6 GHz Intel i7-4970 processor with 16GB memory.

An Emulation of Quantum Error-Correction on an FPGA device

ABSTRACT. Quantum computing has enormous potential to address some of the most challenging data processing problems known today. However, engineering challenges associated with realising practical, fault-tolerant quantum computers and the lack of availability of these devices in the medium term mean emulation is a key requirement in the development of next-generation quantum algorithms. Field-Programmable Gate Array are promising hosts for these emulators. This paper presents, so far as the authors are aware, the first known emulation of an error-correct qubit on the record. Specifically, it conveys that, via either floating- or fixed-point number representation, 11 bits are sufficient to maintain relative accuracy of a 9-qubit Shor code. Measures of quantum information such as Fidelity and Trace Distance are also considered for the given word size range. It is shown how, by exploiting sparsity linked to the errors which arise, operational complexity can be reduced by more than 99\%. This reduced-complexity version is used to create an accelerator for Xilinx FPGA which enables real-time execution of a fault-tolerant, logical qubit.

OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration

ABSTRACT. Backward projection is one of the most time-consuming steps in method-based iterative reconstruction computed tomography. The 3D back-projection memory access pattern is potentially enough regular to exploit efficiently the computation power of acceleration boards based on GPU or FPGA. The high-level tools like HLS or OpenCL ease consider such particular memory accesses during the design flow without specific hardware IPs. This paper proposes an OpenCL acceleration of the voxel-driven 3D back-projection algorithm on an Arria 10 FPGA. This design flow is based initially on an offline memory access analysis, then iteratively on a performance analysis of each new implementation represented on a Berkeley Roofline model.

By taking advantage of the FPGAs local memory architecture, we have succeeded to design an efficient pipeline reaching maximum bandwidth with stall-free access underlining this platform's interest for memory optimization. Our design flow allowed for a significant improvement of our initial algorithm's computational intensity, resulting in better performance on FPGA. It reaches comparable performance to an embedded GPU implementation and other computed tomography algorithms on FPGAs.

19:00-20:30 Session 3B: Performance Modeling, Soft-Cores & Arithmetic
Performance Modeling and FPGA Acceleration of Homomorphic Encrypted Convolution

ABSTRACT. Privacy of data is a critical concern when applying Machine Learning (ML) to domains with sensitive data. Homomorphic encryption (HE), by enabling computations on encrypted data, has emerged as a promising approach to perform inference on ML models such as Convolution Neural Network (CNN) in a privacy preserving manner. However, convolutions over homomorphic encrypted data — HE-Convolution, which account for a significant portion of the total inference latency, are computationally intensive. For convolutions in unencrypted domain, low latency accelerator designs have been proposed using algorithms such as im2col, frequency domain convolutions, etc. However, developing accelerators for the HE versions of these algorithms is non-trivial. In this work, we develop a unified FPGA design that enables low latency execution of both im2col and frequency domain HE-convolutions. To enable selection of algorithm for each convolution layer of a CNN, we develop a performance model that takes the security parameters and the dimensions of the convolution layers as input and outputs the computation and resource requirements of the two algorithms. We use the performance model to select convolution algorithm for each layer of ResNet-50 and obtain the first low latency batch-1 inference accelerator for a HE-CNN targeting FPGAs using HLS. We compare our design against prior techniques on CPUs and show that our accelerator obtains speedups in the range of 3.5x ~ 20.4x in latency.

Modular Inverse for Integers using Fast Constant Time GCD Algorithm and its Applications

ABSTRACT. Modular inversion, the multiplicative inverse of an integer in the ring of integers modulo a prime number, is widely used in public-key cryptography. However, it is one of the most computationally intensive operations, thus, it remains the main performance bottleneck for many cryptographic algorithms.

This paper presents to the best of the author's knowledge, the first FPGA-based hardware design for computing the multiplicative inverse using the recently proposed fast constant-time Greatest Common Divisor (GCD) algorithm. This paper introduces two distinct design architectures targeting different applications: (a) a full-width design and (b) a sequential design. The presented designs are compact, parameterizable, and scalable in terms of area and speed. The evaluation shows the proposed designs, which are constant-time and protect against timing-based attacks, outperform existing software and hardware implementations that use other modular inversion techniques. As a specific example, this work presents an evaluation focusing on the use of the multiplicative inverse hardware module to accelerate the ElGamal cryptosystem. The proposed design achieves a speed-up of 90% in the modular inverse calculation and a speed-up of 45% in the overall ElGamal decryption algorithm using our sequential hardware design of fast constant-time GCD algorithm.

In addition to developing the fast hardware implementation, this work potentially opens up a new direction for designing cryptosystems: the inverse operation is often avoided when designing algorithms, due to its complexity. With the new hardware module, using the inverse becomes more tractable, making it more appealing to use in the design of new cryptosystems.

Dense FPGA Compute using Signed Byte Tuples

ABSTRACT. The importance of AI to FPGA has resulted in the increasing low precision hard arithmetic features in newer devices. Many FPGAs, including those from Achronix, Intel, and Xilinx, have significantly increased the density of INT8 and INT9 embedded multipliers. Mainstream devices with these enhanced densities still support the traditional intermediate integer (typically 18-bit) multipliers, with IEEE754 floating point now becoming more prevalent as well.

Recently, Intel introduced the Stratix 10 NX, which is targeted specifically at AI acceleration. This device contains a new type of AI specific DSP Blocks with approximately an order of magnitude higher INT8 density than previous FPGA industry DSP Blocks. Unfortunately, the larger standard FPGA integer precisions are not directly supported. Intel has described some methods of aggregating larger multipliers from the NX Blocks, but these have a smaller dynamic range than typical DSP precisions. Larger multiplications can also be useful for other AI applications, such as found in training.

In this paper, we introduce the concept of signed tuples, which can be used to compose signed multipliers into more useful larger multiplier precision, by leveraging FPGA soft logic inexpensively. We demonstrate several constructions of INT16 multipliers, vectors, and tensors with as few as 2 ALMs per INT16 multiplier, and extend this method to larger multipliers and alternate constructs such as complex multiplication. We show that there is essentially no performance degradation or system fitting impact from our method. The mid-size NX device can support up to 29,700 constructed INT16 multipliers with this approach, which is higher than any other current or announced monolithic die FPGA.

A RISC-V softcore optimised for exploring custom SIMD instructions

ABSTRACT. This paper presents a high-performance open-source RISC-V (RV32IM) softcore, optimised for exploring custom SIMD instructions. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instruction accelerators can be developed with few low-level lines of code. A novel, non-standard set of vector instruction types is proposed, which allow simultaneous access to a large number of registers for advanced operations, reducing the instruction count. In order to maximise custom SIMD instruction performance, the design’s memory system is optimised for streaming bandwidth, such as very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications with custom instructions. It also provides insights on the effectiveness of adding FPGA resources in general purpose processors in the form of reconfigurable SIMD instructions.

DO-GPU: A Domain Optimizable General-Purpose Soft GPU

ABSTRACT. Soft GPUs are overlays that implement data parallel SIMT processor architectures typically used by GPUs in FPGA logic and, therefore, can make FPGAs software-programmable while also enabling hardware specialization. Prior soft GPU work (including FPGU, FlexGrip, MIAOW, and SCRATCH) targeted general data parallel workloads as well as optimizations for a particular application domain (e.g., PDL-FGPU specialized for persistent deep learning) This paper proposes a soft GPU development framework that enhances prior general-purpose and domain-optimized soft GPUs. Our evaluation on a set of data parallel workloads shows that the proposed general soft GPU architecture (i) offers average speedup of 1.8x versus the best prior soft GPUs we know of (i.e., FGPU, PDL-FGPU), (ii) with domain-optimizations provides an additional 118x speedup (in average) over our proposed general soft GPU (iii) framework enables building six new domain-optimized soft GPU instances in a matter of days, and (iv) enables quick GPU-like development effort (hours), where code is concise (low 100s of lines) and compiled in seconds without FPGA EDA tools in the loop, assuming an appropriate soft GPU bitstream is already built.

RVfpga: Using a RISC-V Core Targeted to an FPGA in Computer Architecture Education

ABSTRACT. RISC-V FPGA, also written RVfpga, is a freely available course developed by the authors and Imagination Technologies that enables users to understand and use the RISC-V instruction set architecture (ISA), a commercial RISC-V core and system, and the RISC-V ecosystem. RVfpga includes comprehensive instructions, tools, and labs for targeting a commercial RISC-V processor to a field programmable gate array (FPGA) and then using and expanding it to learn about computer architecture, digital design, embedded systems, system-on-chip (SoC) design, and programming. The term RVfpga also refers to the RISC-V SoC, which is based on the open-source SweRVolf SoC and Western Digital’s RISC-V SweRV EH1 core, both which are part of the Chips Alliance ecosystem. The topics covered include targeting the RISC-V SoC to an FPGA, programming in C and RISC-V assembly, running programs in simulation or, optionally, on hardware, using peripherals and adding new ones to the SoC, and analyzing and modifying the RISC-V core and memory system, including adding new instructions to the core. A follow-on course, RVfpga-SoC, shows how to build a RISC-V SoC from building blocks and then run the Zephyr real-time operating system (RTOS) on it. These RVfpga courses are appropriate for adoption by professors as a one to three semester course or for use by industry professionals, researchers, and students. At the completion of the course, users will have a working RISC-V system and have hands-on experience exploring and using both the RISC-V SoC and the RISC-V toolchain, including compilers and simulators.