*An FPGA Accelerator of the Wavefront Algorithm for Genomics Pairwise Alignment
ABSTRACT. In the last years, advances in next-generation sequencing technologies have enabled the proliferation of genomic applications that guide personalized medicine. These applications have an enormous computational cost due to the large amount of genomic data they process. The first step in many of these applications consists in aligning reads against a reference genome. Very recently, the wavefront alignment algorithm has been introduced, significantly reducing the execution time of the read alignment process. This paper presents the first FPGA-based hardware/software co-designed accelerator of such relevant algorithm. Compared to the reference WFA CPU-only implementation, the proposed FPGA accelerator achieves performance speedups of up to 13.5x while consuming up to 14.6x less energy.
Minimal Overhead Optical Time-domain Reflectometer Via I/O Integrated Data Converter Enabled by Field Programmabe Voltage Offset
ABSTRACT. As optical reflectometers become more mature technologies, their capabilities are reaching new highs; however, modern designs not only push the performance of the systems, but also the cost and form-factor. As a result, many of the proposed high-performance devices have not been widely adopted for industry use. This paper takes a step back and not only demonstrates a high-performance optical time-domain reflectometer but does so using components that are already in existence in a communication network, such as transmitter optical sub-assembly (TOSA), receiver optical sub-assembly (ROSA), field programmable gate arrays (FPGA) and system-on-chip (SoC) architecture. Thus, it is possible for this design to be integrated in any fiber-optic transceiver in the future. This is enabled by the field programmable voltage offset (FPVO) calibration capabilities of an internal input buffer, typically used to compensate for process variations, which can be used to achieve a reconfigurable and chip-integrated small signal data converter. Using this data converter in an optical time-domain reflectometer, a minimal overhead design is realized with an optical power sensitivity of -52dBm and dynamic range of 23.5dB.
An FPGA-Based Fully Pipelined Bilateral Grid for Real-Time Image Denoising
ABSTRACT. The bilateral filter (BF) is widely used in image processing because it can perform denoising while preserving edges. It has disadvantages in that it is nonlinear, and its computational complexity and hardware resources are directly proportional to its window size. Thus far, several approximation methods and hardware implementations have been proposed to solve these problems. However, it remains difficult to process large-scale and high-resolution images in real time under the severe constraints of hardware resources.
In this paper, we propose a real-time image denoising system that uses an FPGA based on the bilateral grid (BG). In the BG, a 2D image consisting of x- and y-axis is projected onto a 3D space called a "grid," which consists of axes that correlate to the x-component, y-component, and intensity value of the input image. This grid is then blurred using the Gaussian filter, and the output image is generated by interpolating the grid. Although it is possible to change the window size in the BF, it is impossible to change it on the input image in the BG. This makes it difficult to associate the BG with the BF, and to obtain the property of suppressing the increase in hardware resources when the window radius is enlarged.
This paper demonstrates that the BG with a variable-sized window can be realized by introducing the window radius parameter wherein the window radius on the grid is always 1. We then implement this BG on an FPGA in a fully pipelined way. Further, we verify that our design suppresses the increase in hardware resources even when the window size is enlarged and outperforms the existing designs in terms of computation speed and hardware resources.
OpenCL-based FPGA accelerator for semi-global approximate string matching
ABSTRACT. A FPGA accelerator for the computation of the semi-global Levenshtein distance between a pattern and a reference text is presented. The accelerator provides an important benefit to reduce the execution time of read-mappers used in short-read genomic sequencing. Previous attempts to solve the same problem in FPGA use the Myers algorithm following a column approach to compute the dynamic programming table. We use a novel approach based on diagonals that allows for some resource savings while maitaining a very high throughput of 1 alignment per clock cycle. The design is implemented in OpenCL and tested on several FPGA accelerators. The highest obtained performance is 4.2 TCUPS, the highest ever reported.
An FPGA-based Stochastic SAT Solver Leveraging Inter-Variable Dependencies
ABSTRACT. Control rules of Internet of Things (IoT) and embedded systems are often representable in Satisfiability (SAT) problem. This paper proposes a solution search acceleration technique applicable to hardware SAT solvers by leveraging the inter-variable dependencies inherited in SAT-encoded practical applications. Our proposed technique first analyzes the inter-variable dependency of a target SAT representation and then explores tuning parameters of a stochastic SAT algorithm to make efficient fluctuations useful for solution search. In this paper, we applied our proposed dependency-aware acceleration technique to the FPGA implementation of AmoebaSAT, a state-of-the-art stochastic SAT algorithm. Our evaluation on several SAT instances from two different IoT-oriented domains demonstrated a significant speedup in solution search (an average of 2.47x and up to 3.48x against a state-of-the-art hardware SAT solver with almost no effect on hardware implementation compared with the original (dependency-unaware) implementation.
Carnac: Algorithm Variability for Fast Swarm Model-Checking on FPGA
ABSTRACT. The mapping of software verification algorithms on FPGA promise orders of magnitude faster verification. In 2018, the FPGASwarm realization of the swarm verification algorithm, introduced by SPIN model-checker, was shown 900 times faster than its software counterpart. However, this approach missed important optimization opportunities and glossed over algorithmic design-space exploration.
This paper introduces Carnac, a deeply pipelined swarm verification architecture, which by exposing the algorithmic variability points can realize multiple verification algorithms without requiring modifications to the implementation of the model semantics. Furthermore, in this study, we introduce the Mixed Young Frontier-Bounded (MYFB), a new swarm verification algorithm, found through an efficiency-based design-space exploration.
Evaluated on the BEEM benchmark, the MYFB algorithm shows up to 144% efficiency gain over FPGASwarm on 72% of the models. The Carnac architecture runs at 400MHz on Xilinx Ultrascale+ FPGA, and can accommodate twice more verification cores than FPGASwarm. Overall the evaluation shows a 7.58X speedup over FPGASwarm, while enabling an unprecedented scalability on high-end FPGAs.
ABSTRACT. Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory.
In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance test typically have much larger working sets, thus taking even longer simulation warm-up period.
In this paper, we propose a FPGA-based hybrid memory system emulation platform. We target at the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. Here, because the focus of our platform is on the design of the hybrid memory system, we leverage the on-board hard IP ARM processors to both improve simulation performance while improving accuracy of the results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart gem5.
ABSTRACT. The massive deployment of FPGAs in data centers is opening up new opportunities for accelerating distributed applications. However, developing a distributed FPGA application remains difficult for two reasons. First, commonly available development frameworks (e.g., Xilinx Vitis) lack explicit support for networking. Developers are, thus, forced to build their own infrastructure to handle the data movement between the host, the FPGA, and the network.
Second, distributed applications are made even more complex by using low level interfaces to access the network and process packets. Ideally, one needs to combine high performance with a simple interface for both point-to-point and collective operations.
To overcome these inefficiencies and enable further research in networking and distributed application on FPGAs, we first show how to integrate an open-source 100 Gbps TCP/IP stack into a state-of-the-art FPGA development framework (Xilinx Vitis) without degrading its performance. Further, we provide a set of MPI-like communication primitives for both point-to-point and collective operations as a High Level Synthesis (HLS) library. Our point-to-point primitives saturate a 100 Gbps link and our collective primitives achieve low latency. With our approach, developers can write hardware kernels in high level languages with the network abstracted away behind standard interfaces. To evaluate the ease of use and performance in a real application, we distribute a K-Means algorithm with the new stack and achieve a 1.9X and 3.5X throughput increase with 2 FPGAs and 4 FPGAs respectively.
A Specialized Memory Hierarchy for Stream Aggregation
ABSTRACT. High throughput stream aggregation is essential for many applications that analyze massive volumes of data. Incoming data need to be stored in a sliding window before processing, in case the aggregation functions cannot be computed incrementally. However, this puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ either only on-chip memory (i.e. SRAM) or only off-chip memory (i.e. DRAM) to store the aggregated values. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. This paper introduces a specialized memory hierarchy for stream aggregation. It employs multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported, i.e. large number of concurrently active keys and large sliding windows, at line-rate performance. A 3-level implementation of the proposed memory hierarchy is used in a reconfigurable stream aggregation dataflow engine (DFE), outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher processing throughput, thereby reaching line-rate.
Graph Sampling with Fast Random Walker on HBM-enabled FPGA Accelerators
ABSTRACT. Graph Neural Networks (GNNs) have gained increasing popularity among researchers recently and been employed in many applications. Training GNNs introduces a crucial stage called graph sampling. One of the most important sampling algorithms is Random Walk. Along with many of its variants, they share and suffer from the same performance problem caused by random and fragmented memory access pattern, leading to significant system performance degradation.
In this work, we present an efficient graph sampling engine on modern FPGAs integrated with in-package High Bandwidth Memory (HBM), which brings data closer and faster to the core logic. The hardware walker design is modular and easily
scalable for massive parallelism, to fully utilize the available HBM channels. Our design also provides flexibility to support random walk and two of its variants on both homogeneous and heterogeneous graphs. On real-world graph datasets, we achieve
2.42x-6.69x throughput per Joule over highly optimized baselines on a Xeon CPU. We also implement these algorithms on an NVIDIA Tesla V100 GPU to fairly compare the performance of these two HBM-enabled platforms.
ABSTRACT. Network traffic measurement keeps track of the amount of traffic sent by each flow in the network. It is a core functionality in applications such as traffic engineering and network intrusion detection. In high-speed networks, it is impossible to keep exact count of the flow traffic, due to limitations with respect to memory and computational speed. Therefore, probabilistic data structures, such as sketches, are used. This paper proposes Approximate Count-Min sketch or A-CM sketch, a novel variant of the Count-Min sketch algorithm that uses less memory and has a higher throughput compared to other FPGA-based sketch implementations. A-CM sketch relies on optimizations at two levels: (1) it uses approximate counters and the newly proposed Hardware Simple Active Counter (HSAC) algorithm to efficiently implement these counters in hardware; (2) it uses an optimal distribution of the embedded memory, optimized towards maximum operating frequency. A-CM sketch outperforms all other FPGA-based sketch implementations and is the first to process packets at line rate in 200 Gbps networks.
*Turning PathFinder Upside-Down: Exploring FPGA Switch-Blocks by Negotiating Switch Presence
ABSTRACT. Automated switch-block exploration gains in importance as technology scaling
brings more emphasis on the physical constraints, making it insufficient to
rely on abstract measures of routability alone. In this work, we take an approach
that is somewhat radically different from the previously used ones, relying mostly
on general optimization methods: we essentially let the router itself design
the switch-pattern. Of course, letting the router make arbitrary choices would
be rather ineffective, as there would be nothing to prevent it from spreading
routes over many different switches, making it difficult to understand if a particular
one was used because it is essential for proper implementation of a given design,
or simply due to some local, largely irrelevant decision. Instead, we change the usual
method of pricing the nodes in a negotiated-congestion router, by applying the same
principles in the opposite direction, to make it reach a consensus on switches that
are worthy of being included in the final switch-pattern. With this, we obtained a
pattern that outperforms the one reached through simulated annealing optimization
by 10.7% in terms of average routed critical path delay and uses less than half
the number of switches, without compromising routability.
ABSTRACT. In FPGAs, the programmable interconnect is implemented by multiplexers (MUXes), which have a large impact on the area and delay. In academia, large MUXes are extensively used in intra and inter clusters, resulting in significant FPGA area overhead and load for routing wires. In this paper, we model the interconnect from routing wires and CLB feedbacks to LUT inputs as an input block (IB), and implement the IB and the switch block (SB) with the 2-level MUX topology. Applying the 2-level MUX topology in FPGA routing architecture enables us to explore a larger design space for the area and delay, because the 2-level MUX topology can tradeoff between MUX sizes, connectivity degree, and the input bandwidth. We carefully design a baseline 2-level MUX routing architecture and evaluate it by running place and route experiments with VTR benchmarks. To optimize the baseline 2-level MUX routing architecture, we explore one design parameter at a time by keeping others fixed and perform subsequent explorations based on previous optimal design parameters. The results show that the optimized 2-level MUX routing architecture can achieve 19% shorter critical path delay (CPD) at the cost of 3% area overhead compared to the CB-SB FPGA architecture with 1-level MUX topology.
Load Balance-Centric Distributed Parallel Routing for Large-Scale FPGAs
ABSTRACT. Routing is one of the most time-consuming stages in the FPGA design flow.
Parallelization can accelerate the routing process but suffering from load imbalance, further resulting in a low scalability.
In this paper, we propose a load balance-centric parallel router in a distributed computing environment.
First, we explore regular and irregular region partitioning so that routing tasks are assigned to different cores for static load balance before parallel routing.
Second, we explore message propagation and task migration between underloaded and overloaded cores so that load balance can be dynamically maintained at parallel routing runtime.
Finally, we evaluate the effectiveness of the router using large-scale Titan designs.
Results show that our router achieves about 17x speedup on average using 32 cores, compared with the state-of-the-art VTR 8 router.
A Survey on Hypervisor-based Virtualization of Embedded Reconfigurable Systems
ABSTRACT. The increase of size, capabilities, and speed of FPGAs enables the shared usage of reconfigurable resources by multiple applications and even operating systems. While research on FPGA virtualization in HPC-datacenters and cloud is already well-advanced, it is a rather new concept for embedded systems. The necessity for FPGA virtualization of embedded systems results from the trend to integrate multiple environments into the same hardware platform. As multiple guest operating systems with different requirements, e.g., regarding real-time, security, safety, or reliability share the same resources, the focus of research lies on isolation under the constraint of having minimal impact on the overall system. Drivers for this development are, e.g., computation intensive AI-based applications in the automotive or medical field, embedded 5G edge computing systems, or the consolidation of electronic control units (ECUs) on a centralized MPSoC with the goal to increase reliability by reducing complexity. This survey outlines key concepts of hypervisor-based virtualization of embedded reconfigurable systems. Hypervisor approaches are compared and classified into solutions that run on a single FPGA, that virtualize an FPGA tightly coupled with CPU cores on an MPSoC, and into those solutions that virtualize a distributed embedded reconfigurable system. Strong points and limitations are pointed out and future trends for virtualization of embedded reconfigurable systems are identified.
ABSTRACT. Field-Programmable Gate Arrays (FPGAs) have been increasingly deployed in datacenters and there has been a lot of focus on tools that help the development of FPGA applications. However, unlike the software world, there are not many FPGA tools that help analyze the performance of FPGA applications. Also, as the application platforms scale from one FPGA to many FPGAs, such as the well-known Microsoft Catapult platform, it is no longer feasible to rely on low-level FPGA debugging tools such as embedded logic analyzers. We present Pharos, a lightweight, generic performance monitor for multi-FPGA systems. Pharos is capable of measuring unidirectional latency between multiple network-connected FPGAs, as well as measuring throughput and tracing events across the datacenter.
Exploiting the Potential of Approximate Arithmetic in DSP & AI Hardware Accelerators
ABSTRACT. Approximate computing is an emerging design paradigm, which exploits the inherent error resilience of numerous applications to improve their energy efficiency and/or performance. The current paper focuses on applications from the DSP and AI domains, and examines the impact of arithmetic approximations on accelerators for FPGA and ASIC technologies. Based on our design methodology, we implement and evaluate approximate architectures for image processing, signal filtering, telecommunication digital functions, and convolutional neural networks. The evaluation shows that sophisticated bit-level optimizations and disciplined approximations deliver significant gains in the hardware resources and performance of the accelerator in exchange for small errors and tunable accuracy loss.
Offloading Methodologies for Power-Aware Real-time Operating Systems
ABSTRACT. Real-time operating systems (RTOSs) are mainly implemented in software and sequentially executed on processors. The periodic call of the task scheduling service introduces additional overhead in software and eventually leads to jitter. However, the occurring overhead can be shortened or eliminated by using reconfigurable systems. The presented PhD project deals with offloading methodologies of RTOS components considering power dissipation on reconfigurable platforms. For this purpose, the task scheduling of FreeRTOS has already been offloaded to the co-processor and a computing time of 38.9\% has been obtained while scaling voltage and frequency on XC7Z020.
Design For Agility: A Modular Reconfigurable Platform for Heterogeneous Many-Core Architectures
ABSTRACT. Reconfigurable many-core computing platforms are gaining increasing attention for cloud and edge computing because of their high degree of scalability as well as flexibility. Heterogeneous many-core architectures provide more computing capabilities for domain-specific and general-purpose applications. However, bringing heterogeneous and custom computing elements together increases on-chip communication and run-time management complexities. This leads to growing design time and development cost, in addition to lack of platform re-usability. The scope of this PhD work is the design of a modifiable and modular hardware platform that provides a high degree of agility to change types or specifications of computing elements at design and run-time to achieve the best performance for different application demands using the same platform components. Different acceleration strategies and memory hierarchies are supported. In this paper, the proposed platform, preliminary results, and evaluation are presented targeting FPGAs. Finally, planned and future works are highlighted.
A Novel Top to Bottom Toolchain For Generating Virtual Coarse-Grained Reconfigurable Arrays
ABSTRACT. The work, presented in this paper describes a toolchain which was created with the target to generate, configure and evaluate user-customizable Virtual Coarse-Grained Reconfigurable Array (VCGRA) architectures.
The benefit of this toolchain is, that it provides an automatic development and evaluation platform for a VCGRA architecture, including synthesis and execution of the VCGRA with its corresponding hardware configuration and the required interfaces on an FPGA platform.
The toolchain also generates the necessary soft- and hardware modules for the transmission of data and configurations between the processing system (PS) and the VCGRA on reconfigurable hardware.
All the tools for hardware generation and software conversion are functional and linked through interfaces and controllable by a consistent user interface. The modular design of the toolchain also allows various improvements and further application areas to be added, so we are currently in the process to open-source the mentioned tools to share them with a bigger community.
Reconfigurable Computing Systems as Component-oriented Designs for Robotic Applications
ABSTRACT. Modern robotic platforms are increasingly complex due to incorporating various heterogeneous sensors and several actuators generating data at large frequencies.
To cope with this amount of data, most systems are based on embedded computers, but relying on software solutions not fully suited for parallel processing.
FPGAs are an ideal candidate to solve this issue enhancing the computing capabilities of those systems, while still being programmable.
This work studies the components that a Reconfigurable Computing System (RCS) needs to become part of a Robotic System enhancing it by integrating multiple heterogeneous hardware accelerators.
A model-based approach is followed to automatically generate the necessary hardware components, taking into account specifications such as existing message formats to exchange data among all components of the system and those related to each specific applications.
Optimizing Deep Learning Decoders for FPGA Implementation
ABSTRACT. Recently, Deep Learning (DL) methods have been proposed for use in the decoding of linear block codes. While novel DL decoders show promising error correcting performance, they suffer from computational complexity issues, which prevent their usage with large block codes and make their implementation in digital hardware inefficient. The subject of the presented doctoral research is the design of DL decoding methods with low computational complexity and resource requirements, by applying compression and approximation techniques to the employed Neural Networks. Efficient hardware architectures are expected to be designed for these optimized DL decoders on FPGA devices, which will overcome the current performance limitations.
ABSTRACT. Nowadays, the increasing number of processing elements (PEs) in Multiprocessor Systems-on-Chip (MPSoCs) require a scalable on-chip interconnect. Networks-on-Chip (NoCs) have emerged as the most promising communication technology for MPSoCs. In this work, novel routers called Application-Specific Instruction-Set routers (ASIRs) that reduce the communication overhead by integrating a processing layer into the communication layer of MPSoCs and the corresponding flow control of packets defined as Wormhole Computing are presented. The routers are constructed with processing units inside the input buffers providing application-specific operations that can be executed on transferred data. Hence, the communication time can be efficiently used by processing data that is sent by packets through the NoC. A KPN-based computation model has been modified to support not only the mapping of tasks to PEs but also the mapping to ASIRs. In addition to analytical considerations, ASIRs have been evaluated with signal processing applications on a Xilinx Zynq SoC. The analytical considerations prove that a speedup can be achieved by moving a sequence of instructions from processors to ASIRs. The results of both use cases show better exploitation of the communication time by up to 42.8% and a speedup of up to 5x for a single task.
Accelerating Fixed-Point Simulations Using Width Reconfigurable Hardware Architectures
ABSTRACT. Digital Signal Processing (DSP) systems can be described using either fixed-point or floating-point for their numeric representations. As the general computers use floating-point representation, software-based simulation tools used for modeling and simulation of the arithmetic operations in DSP systems, use floating-point representation as well. However, synthesizing customized hardware for fixed-point arithmetic operations for FPGAs or ASICs is more efficient compared to their floating-point counterparts. Thus it is necessary to convert the representation of a floating-point simulated algorithm on MATLAB for example, to a fixed-point representation which is more suitable for hardware implementation. While former approaches for this conversion step have always been software-based, like on MATLAB itself, this paper presents a new approach to show the possibility of accelerating it by using hardware width re-configurable designs.
Towards the Efficient Multi-Platform Execution of Deep Neural Networks
ABSTRACT. Modern Systems-on-Chip (SoCs) based on Field-Programmable Gate Arrays (FPGAs) offer users significant flexibility in deciding the best approach to implement Convolutional Neural Networks (CNNs): a) in a fixed, hardwired general-purpose processor, b) using the programmable logic to implement application-specific processing cores, or c) FPGA overlays.
We propose an automated toolflow that maps Tensorflow/Keras pre-trained models into four different possible platforms: GPU, Dedicated IPs in an FPGA, ARM core, and a soft-core GPU for FPGA. CNNs are heterogeneous, meaning that convolutional layers, for example, will have different resource access and computation requirements as the Fully Connected (FC) layers, hinting that different hardware may be optimal for different layer types.
After evaluating the performance of different CNNs executed in an ARM Cortex-A9 and the soft-core GPU, we found that convolutional layers were 5.9x faster in the soft-core GPU than in the ARM core. As a result, we propose a collaborative execution of CNNs using these two platforms together, achieving a speedup of 11.6x against using only the ARM core. As a consequence, we are exploring other mixes of hardware platforms or even using partial reconfiguration techniques.