Distributed Recommendation Inference on FPGA Clusters
ABSTRACT. Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimized one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95× and 7.68× speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds.
SyncNN: Evaluating and Accelerating Spiking Neural Networks on FPGAs
ABSTRACT. In this paper, we propose a novel synchronous approach for rate encoding based Spiking Neural Networks (SNNs), which is more hardware friendly than conventional asynchronous approaches. We also design and implement the SyncNN framework to accelerate SNNs on Xilinx ARM-FPGA SoCs in a synchronous fashion. To improve the computation and memory access efficiency, we first quantize the network weights to 16-bit, 8-bit, and 4-bit fixed-point values with the SNN friendly quantization technique. For the encoded neurons that have dynamic and irregular access patterns, we design parameterized compute engines to accelerate their performance on the FPGA, where we explore various parallelization strategies and memory access optimizations. Our experimental results on multiple Xilinx ARM-FPGA SoC boards demonstrate that our SyncNN is scalable to run multiple networks, such as LeNet, Network in Network, and VGG, on various datasets such as MNIST, SVHN, and CIFAR-10. SyncNN not only achieves competitive accuracy (99.6%) but also achieves state-of-the-art performance (13,086 frames per second) for the MNIST dataset.
ABSTRACT. Real-time edge AI systems operating in dynamic environments must be able to learn quickly from streaming input samples without the need to undergo offline model training. We propose a FPGA accelerator for a continual learning model based on streaming linear discriminant analysis (SLDA), which is capable of class-incremental object classification. The proposed SLDA accelerator employs application-specific parallelism, efficient data reuse, resource sharing, and approximate computing strategies to achieve high performance and power efficiency. Additionally, we introduce a new variant of SLDA and discuss the accuracy-hardware efficiency tradeoffs among the SLDA variants. The proposed SLDA accelerator is combined with a Convolutional Neural Network (CNN) which is implemented on the Xilinx DPU, to achieve a full continual learning model capability at nearly the same latency as inference. Experiments based on popular datasets for continual learning, CoRE50 and CUB200, demonstrate that the proposed accelerator for SLDA outperforms the embedded CPU and GPU counterparts, in terms of latency and energy consumption.
Leveraging Fine-grained Structured Sparsity for CNN Inference on Systolic Array Architectures
ABSTRACT. Many prior works have studied the use of field-programmable gate arrays (FPGAs) for accelerating convolutional neural network (CNN) inference. Among these, systolic arrays stand out for effectively leveraging the parallelism in CNNs while achieving good placement and routing quality. However, their performance is often limited by the number of DSP multipliers on an FPGA.
In this paper, we leverage weight sparsity in CNN convolutional and fully-connected layers to improve performance. Concretely, we propose a novel pruning method that can generate fine-grained structured weight sparsity patterns in CNNs that guarantee load balance in the systolic array. To leverage this sparsity, we introduce a novel processing element (PE) for the systolic array. This PE requires moderate on-chip resources, leading to improved scalability. We develop an end-to-end systolic array CNN inference accelerator that leverages our pruning technique and PE on an Intel Arria 10 GX1150 FPGA. On ResNet-50 and VGG-16 trained on the ImageNet dataset, our sparse weight accelerator achieves 2.14 TOPs/s and 1.10 TOPs/s on the convolution and fully-connected layers, respectively. These results translate to 2.71x and 1.59x speed-up compared to a dense systolic array baseline.
Increasing Flexibility of FPGA-based CNN Accelerators with Dynamic Partial Reconfiguration
ABSTRACT. Convolutional Neural Networks (CNN) are widely used for image classification and have achieved significantly accurate performance in the last decade. However, they require computationally intensive operations for embedded applications. In recent years, FPGA-based CNN accelerators have been proposed to improve energy efficiency and throughput. While dynamic partial reconfiguration (DPR) is increasingly used in CNN accelerators, the performance of dynamically reconfigurable accelerators is usually lower than the performance of pure static FPGA designs. This work presents a dynamically reconfigurable CNN accelerator architecture that does not sacrifice throughput performance or classification accuracy. The proposed accelerator is composed of reconfigurable macroblocks and dynamically utilizes the device resources according to model parameters. Moreover, we devise a novel approach, to the best
of our knowledge, to hide the computations of the pooling layers inside the convolutional layers, thereby further improving throughput. Using the proposed architecture and DPR, different CNN architectures can be realized on the same FPGA with optimized throughput and accuracy. The proposed architecture is evaluated by implementing two different LeNet CNN models trained by different datasets and classifying different classes. Experimental results show that the implemented design achieves higher throughput than current LeNet FPGA accelerators.
Enabling Mixed-Timing NoCs for FPGAs: Reconfigurable Synthesizable Synchronization FIFOs
ABSTRACT. We present an architecture of a reconfigurable high-throughput synthesizable synchronization FIFO for crossing between asynchronous and synchronous timing domains. This FIFO is composed of reconfigurable mix-and-match components. The input and output interfaces are interchangeable for edge-triggered synchronous communication and for the asP* asynchronous pulse-based handshake protocol. The FIFO capacity, data width, synchronizer latency, and interface protocols are independent design parameters. The FIFO is fully synthesizable using widely available standard-cell libraries and a standard ASIC design flow. Our post-layout design can operate at speeds greater than 1.2 giga-transfer per second under worst-case conditions when implemented in a 65nm CMOS process.
A Flexible Multi-Channel Feedback FxLMS Architecture for FPGA Platforms
ABSTRACT. The most used algorithm in active noise control (ANC) applications is the filtered-x least mean square (FxLMS). For large scale systems with multiple inputs and outputs the computational demand of the FxLMS is rising rapidly. Conventional solutions, running on digital signal processors (DSPs), have limitations in parallel computing. In this work a parameterizable multiple input-multiple output (MIMO) feedback FxLMS architecture for field-programmable gate array (FPGA) platforms is presented that can easily be optimized for high-performance or resource efficient computation. The implementation is validated in a real-time practical application using an active headrest with 2 x 2 channels. The noise reduction is evaluated over various configurations with increasing performance and the respective synthesis results are presented. Here the noise reduction does benefit from the additional computational performance gains. An additional configuration with 11 x 11 channels and filter lengths of 2049 at a sample rate of 40 kHz is shown as a benchmark on high-end FPGAs.
Clock Skew Scheduling: Avoiding the Runtime Cost of Mixed-Integer Linear Programming
ABSTRACT. Clock Skew Scheduling has become a common practice in state-of-the-art FPGAs with the introduction of delay chains on the clock path in the hardware of both Xilinx and Intel FPGAs, as well as clock skew scheduling algorithms in the CAD tools. Ideally, globally optimal solutions are sought to find the best solution across the entire design. However, using mixed-integer linear programming to find such optimal solutions has a large and sometimes unrealistic runtime cost, especially for bigger designs. Besides, a high runtime does not necessarily correlate to improved performance. We present, in this paper, techniques to reduce the runtime of mixed-integer linear programming approaches to clock skew scheduling, and provide alternatives that can achieve optimal or near-optimal performance gain with a fraction of the runtime cost.
Post-LUT-Mapping Implementation of General Logic on Carry Chains Via a MIG-Based Circuit Representation
ABSTRACT. Carry chains on FPGAs have traditionally been only used for fast binary arithmetic operations. In this paper, we propose using the carry chain to implement general logic as a means of reducing the critical path delay and raising performance. To achieve this, we use a Majority-Inverter Graph (MIG) to represent the application during technology mapping, since carry functionality directly maps to the majority logic function. This aligns the subject graph of technology mapping with the capabilities of the carry chain. We first map an application to LUTs, then determine a chain of critical LUTs containing paths of majority “gates” that we deem beneficial for mapping onto the carry chain. We place such paths onto the carry chains, with the remaining logic in LUTs. In an experimental study using a suite of benchmarks, we observe that the proposed approach yields a post-place-and-route critical path delay that is superior to using delay-optimized mapping, yet without the significant area penalty. With carry-chain optimizations, area-delay product is improved by 9% vs. baseline LUT mappings.
(Short Paper) Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining
ABSTRACT. High-level synthesis (HLS) automatically transforms high-level programs in a language such as C/C++ into a low-level hardware description. In this context, loop pipelining is a key optimisation method for improving hardware performance. The main performance bottleneck of a pipelined loop is the ratio between two values: the latency of each iteration and the dependence distance of the operations in the loop. These two values are usually not known exactly, so existing HLS schedulers model them independently, which can cause sub-optimal performance. This paper extends state-of-the-art static schedulers with a fully automated pass that exposes and takes advantage of potential correlation between these two values, enabling smaller initiation intervals~(II). We use the Microsoft Boogie software verifier to prove the existence of these correlations, which allows HLS tools to automatically find a high-performance hardware solution while maintaining correctness. Our results show that for a certain class of programs, our approach achieves, on average, an 11.1x performance gain at the cost of a 95x area overhead.
MAFIA: Machine Learning Acceleration on FPGAs for IoT Applications
ABSTRACT. Recent breakthroughs in ML have produced new classes of models that allow the ML inference to run directly on milliwatt-powered IoT devices. On one hand, existing ML-to-FPGA compilers are designed for deep neural-network models on large FPGAs. On the other hand, general-purpose HLS tools fail to exploit properties specific to ML inference, thereby resulting in suboptimal performance. We propose MAFIA, a tool to compile ML inference on small form-factor FPGAs for IoT applications. MAFIA is general and can express a variety of ML algorithms, including state-of-the-art models. We show that MAFIA-generated programs outperform best-performing variant of a commercial HLS compiler by 2.5X on average.
Koios: A Deep Learning Benchmark Suite for FPGA Architecture and CAD Research
ABSTRACT. With the prevalence of deep learning (DL) in many applications, researchers are investigating different ways of optimizing FPGA architecture and CAD to achieve better quality-of-results (QoR) on DL-based workloads. In this optimization process, benchmark circuits are an essential component; the QoR achieved on a set of benchmarks is the main driver for architecture and CAD design choices. However, current academic benchmark suites are inadequate, as they do not capture any designs from the DL domain. This work presents a new suite of DL acceleration benchmark circuits, for FPGA architecture and CAD research, called Koios. This suite of 19 circuits covers a wide variety of accelerated neural networks, design sizes, implementation styles, abstraction levels, and numerical precisions. These designs are larger, more data parallel, more heterogeneous, more deeply pipelined, and utilize more FPGA architectural features compared to existing open-source benchmarks. This enables researchers to pin-point architectural inefficiencies for this class of workloads and optimize CAD tools on more realistic benchmarks that stress the CAD algorithms in different ways. In this paper, we describe the designs in our benchmark suite, present results of running them through the Verilog-to-Routing (VTR) flow using a recent FPGA architecture model, and identify key insights from the resulting metrics. On average, our benchmarks have 3.7x more netlist primitives, 1.8x and 4.7x higher DSP and BRAM densities, and 1.7x higher frequency with 1.9x more near-critical paths compared to the widely-used VTR suite. Finally, we present two example case studies showing how architectural exploration for DL-optimized FPGAs can be performed using our new benchmark suite.
ABSTRACT. Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing. To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered from the topology down to the final hardware implementation. Interdependencies between the different design layers have to be taken into account and explored efficiently, making it hardly possible to find optimized solutions manually. An automatic, holistic design approach can improve the quality of DNN implementations on FPGA significantly. To this end, we present a cross-layer design space exploration methodology. It comprises optimizations starting from a hardware-aware topology search for DNNs down to the final optimized implementation for a given FPGA platform. The methodology is implemented in our Holistic Auto Machine Learning for FPGAs (HALF) framework, which combines an evolutionary search algorithm, various optimization steps and a library of parametrizable hardware DNN modules. HALF automates both the exploration process and the implementation of optimized solutions on a target FPGA platform for various applications. We demonstrate the performance of HALF on a medical use case for arrhythmia detection for three different design goals, i.e. low-energy, low-power and high-throughput respectively. Our FPGA implementation outperforms a TensorRT optimized model on an Nvidia Jetson platform in both throughput and energy consumption.
MAPLE: A Machine Learning based Aging-Aware FPGA Architecture Exploration Framework
ABSTRACT. Transistor aging raises a vital challenge for the lifetime reliability of FPGAs in advanced technology nodes, which could lead to performance loss and timing failure of FPGA designs over the time. Most existing studies use time-consuming transistor-level simulation to analyze the aging impact on FPGA architectures, which is impractical for modern FPGAs with billions of transistors.
In this paper, we develop a framework called MAPLE to enable the aging-aware FPGA architecture exploration. The core idea is to efficiently model the aging-induced delay degradation at the coarse-grained FPGA basic block level using deep neural networks (DNNs). For each type of the FPGA basic block such as LUT and DSP, we first characterize its accurate delay degradation via transistor-level SPICE simulation under a versatile set of aging factors from the FPGA fabric and in-field operation. Then we train one DNN model for each block type to quickly and accurately predict the complex relation between its delay degradation and comprehensive aging factors. Moreover, we integrate our DNN models into the widely used Verilog-to-Routing CAD flow (VTR 8) to support analyzing the aging-induced delay degradation impact on the entire large-scale FPGA designs. Experimental results demonstrate that the proposed technique can predict the delay degradation of FPGA blocks $10^4$ to $10^7$ times faster than transistor-level SPICE simulation, with the maximum prediction error of less than 0.7\%.
4. Nikolaos Bellas, Christos D. Antonopoulos, Spyros Lalis, Maria-Rafaela Gkeka, Alexandros Patras, Georgios Keramidas, Iakovos Stamoulis, Nikolaos Tavoularis, Stylianos Piperakis, Emmanouil Hourdakis, Panos Trahanias, Paul Zikas, George Papagiannakis, Ioanna Kartsonaki. Architectures for SLAM and Augmented Reality Computing
5. Philipp Kasgen, Mohamed Messelka and Markus Weinhardt. HiPReP: High-Performance Reconfigurable Processor - Architecture and Compiler
6. Jürgen Becker, Leonard Masing, Tobias Dörr, Florian Schade, Georgios Keramidas, Christos P. Antonopoulos, Michail Mavropoulos, Efstratios Tiganourias, Vasilios Kelefouras, Konstantinos Antonopoulos, Nikolaos Voros, Umut Durak, Alexander Ahlbrecht, Wanja Zaeske, Christos Panagiotou, Dimitris Karadimas, Nico Adler, Andreas Sailer, Raphael Weber, Thomas Wilhelm, Florian Oszwald, Dominik Reinhardt, Mohamad Chamas, Adnan Bekan, Graham Smethurst, Fahad Siddiqui, Rafiullah Khan, Vahid Garousi, Sakir Sezer, Victor Morales. XANDAR: X-by-Construction Design framework for Engineering Autonomous & Distributed Real-time Embedded Software Systems
7. Angelos S. Voros, Christos Panagiotou, Stavros Zogas, Georgios Keramidas, Christos P. Antonopoulos, Michael Hubner, Nikolaos S. Voros. The SMART4ALL High Performance Computing Infrastructure: Sharing high end hardware resources via cloud-based microservices