View: session overviewtalk overview
| 10:30 | FLICS: FPGA-acceLerated Ingestion of documents into Columnar Storage ABSTRACT. The increasing demand for efficient data processing has led to the development of various data storage formats and processing techniques. Columnar storage formats, such as Apache Parquet, are widely used for their efficiency in handling analytical queries on nested documents. However, incoming data is often encoded in row-based formats, necessitating a transformation into a columnar format before analytical queries can be applied. This transformation is computationally intensive and can introduce significant resource and performance overheads. As a remedy, we introduce FLICS, an FPGA-accelerated solution for parsing and ingesting row-oriented Avro data into the columnar Parquet storage format. FLICS leverages an application-specific instruction-set processor (ASIP) architecture to parse Avro data and convert it into the columnar format. The ASIP is controlled by an instruction program tailored to parse a specific schema. Consequently, any schema modification merely necessitates loading a new instruction sequence into the instruction memory. Additionally, we introduce a schema compiler that automatically generates these instruction programs, facilitating seamless schema transitions. Our evaluation demonstrates that FLICS achieves up to 3.3 times higher throughput compared to state-of-the-art software solutions running on a fully utilized 24-core CPU, while using only a small fraction of the available FPGA resources. |
| 10:50 | A Flexible Open-Source Framework for FPGA-based Network-Attached Accelerators using SpinalHDL ABSTRACT. Domain-specific accelerators are increasingly vital in heterogeneous computing systems, driven by the demand for higher computational capacity and especially energy efficiency. Network-attached FPGAs promise a scalable and flexible alternative to closely coupled FPGAs for integrating accelerators into computing environments. While the advantages of specialized hardware implementations are apparent, traditional hardware development and integration remain time-consuming and complex. We present an open-source framework which combines a hardware shell with supporting software libraries, which enables fast development and deployment of FPGA-based network-attached accelerators. In contrast to traditional approaches using VHDL or Verilog, we leverage generative programming with SpinalHDL, providing a flexible hardware description with multi-level abstractions. This work eases the integration of accelerators into existing network infrastructures and simplifies adaptation to different FPGAs, eliminating complex and lengthy top-level hardware descriptions. |
| 11:10 | FPGA-Based Energy-Efficient Decision tree ensemble Accelerator PRESENTER: Unai Fernandez-Urroz ABSTRACT. The growing complexity of artificial intelligence techniques and the increasing size of modern datasets have led to a substantial rise in energy consumption in conventional implementations optimized for CPUs and GPUs. Field Programmable Gate Arrays represent a promising alternative, as they enable customized hardware acceleration that improves both latency and performance while maintaining high energy efficiency. This paper presents an FPGA-based architecture for accelerating tree ensemble inference, with emphasis on energy efficiency and resource optimization. The proposed system integrates a Python-based training workflow that generates a flexible configuration file, enabling adaptable deployment across varying numbers of trees and datasets characteristics. The architecture exploits fine-grained parallelism by executing decision trees concurrently and aggregating their predictions through a hardware-based majority voting scheme. Experimental evaluation on a Zynq-based platform demonstrates low resource utilization, requiring less than 4\% flip-flops and 16\% LUTs, while BRAM becomes the main limiting factor, reaching 70\% for a 64-tree configuration. The total power dissipation is approximately 1.6 W of which only 0.2 W corresponds to the programmable logic implementing the accelerator. Finally, we found competitive inference latency compared to related FPGA implementations. These results confirm the suitability of the design for energy-constrained embedded AI applications. |
| 11:30 | OpenEye: A scalable Open-Source Hardware Accelerator for DNNs ABSTRACT. The increasing computational complexity of deep neural network inference poses significant challenges for efficient hardware acceleration on embedded platforms, particularly with respect to resource consumption and scalability. This work presents OpenEye, a scalable and sparsity-aware FPGA-based hardware accelerator designed to efficiently execute common neural network operations such as convolutions, dense layers, and pooling. OpenEye is based on a highly parameterizable architecture composed of clusters of processing elements (PEs) interconnected by a streaming-based dataflow. The paper provides a detailed explanation of the internal operation of the accelerator, including data movement, buffering strategies, control logic, and the coordination between clusters and PEs. The architecture natively supports sparse weights and activations, enabling the efficient processing of sparse data without unnecessary computations or memory accesses. A key design property of OpenEye is its scalability: the number of clusters and processing elements can be varied to adapt the accelerator to different performance and resource constraints. The design achieves a near-linear scaling of routing and interconnect overhead with increasing PE counts, which is essential for maintaining efficiency on large FPGA devices. To evaluate scalability across different design points, multiple OpenEye configurations with varying cluster and PE sizes were implemented on a Xilinx ZU19EG FPGA. Representative neural network operations, including convolutional, fully connected, and pooling layers, were used to analyze resource utilization, execution latency, and scalability behavior. The results show favorable trade-offs between performance and resource consumption across the explored configurations. |
| 11:50 | Research Group Forum: Runtime-Adaptive Cross-Level Techniques for Efficient and Reliable Edge AI ABSTRACT. Deep neural networks (DNNs) have achieved remarkable accuracy in decision-making tasks and are increasingly deployed in safety-critical edge AI applications. While accuracy remains a key requirement in such domains, edge computing platforms face strict constraints in terms of silicon area, power budget, and memory capacity, which limit their ability to accommodate the growing complexity of modern DNN models. Moreover, real-time constraints impose additional trade-offs between inference latency, computational depth, and reliability. Although extensive research has been conducted on cost-efficient reliability mitigation and model optimization techniques, such as pruning and quantization, their effectiveness is gradually saturating due to their reliance on static, one-size-fits-all optimization strategies that lack adaptivity to input, workload, and hardware conditions. To address these challenges, adaptive and cross-level optimization strategies are required to move beyond static, one-size-fits-all designs and enable flexible computation on fixed silicon. This paper presents three complementary techniques targeting cost efficiency and reliability enhancement of DNN inference across the model level, hardware architecture level, and arithmetic level. By jointly exploiting dynamic neural execution, hardware-aware optimization, and adaptive approximate arithmetic, the proposed approaches introduce runtime adaptability to varying input complexity, workload conditions, and hardware constraints. Experimental results demonstrate that each proposed technique significantly improves performance and robustness compared to conventional static and single-level optimization methods, making the approach well suited for safety-critical edge AI applications. |
| 13:05 | Bringing a L3 Load Balancer Closer to the Network; Experiments on the Nvidia Bluefield-3 SmartNIC ABSTRACT. With the advent of programmable network hardware, more functionality can be moved from software running on general purpose CPUs to the NIC. Early NICs only allowed offloading fixed functions like checksum computation and encryption. Recent NICs like the Nvidia Bluefield-3 allow a fully programmable dataplane. In this paper, we present our first steps towards a load balancer running on the on-path "Accelerated Programmable Pipeline" of the Bluefield-3. Our results show that for small packets the Bluefield-3 does not achieve line rate even with only 2 entries in a Flow Pipe, but we achieve a 44% lower latency compared to a comparable eBPF-based load balancer running on the host, even under high load. Furthermore, we show the capabilities and the limits of the Bluefield-3 "Accelerated Programmable Pipeline". |
| 13:25 | Adaptive Task Offloading to BlueField-2 DPU using INTadapt ABSTRACT. In the context of Modern High-Performance Computing (HPC) nodes often suffer from resource contention, where Operating System noise, Message Passing Interface (MPI) synchronization, and data movement consume valuable CPU cycles. In the conventional systems, network interfaces act as passive data pipes, leaving the intelligent decision-making to the host CPU. Although SmartNICs offering programmable cores are available, integrating them into existing scientific workflows remains challenging due to complex programming models. This paper presents INTadapt, a transparent framework designed to extend StarPU’s scheduling capabilities by enabling intelligent, proactive task offloading to SmartNICs.Our system integrates HALadapt’s reactive history-tracking and memory abstraction layers to dynamically identify network-intensive kernels. These kernels are then migrated to NVIDIA’s InfiniBand BlueField Data Processing Unit (DPU)’s ARM cores using runtime binary interception, allowing existing HPC workloads to leverage DPU acceleration without manual source-code refactoring or specialized DPU programming. Experimental results show that our approach achieves a ~60% reduction in execution latency for offloaded network tasks. Crucially, we report that by offloading these operations, Host CPU utilization for communication-related overheads drops to near 0%. This strategic release of hardware resources allows the Host to concurrently execute heavy computational kernels—such as scientific math libraries—that would otherwise remain queued. Consequently, we achieve a near-complete computation-communication overlap, transforming the DPU into an active asynchronous worker that enhances the overall throughput of legacy HPC applications. |
| 13:45 | Speeding Up Berstein's Formulas for Permutation Networks ABSTRACT. Permutation networks, such as Beneš and Waksman net- works, are pivotal in high-speed data routing for applications such as parallel computing, optical switching, and cryptography, where they en- able efficient data shuffling. The performance of these networks hinges on the efficient computation of control bits that configure their switch settings. We propose a vectorized implementation of Bernstein’s verified fast formulas [6] that uses RISC-V Vector Extensions (RVV) to accelerate the computation. Our approach is validated on the Ibex processor paired with the Vicuna vector coprocessor, comparing cycle counts, instruction counts, and memory footprints of non-vectorized and vectorized imple- mentations. The results demonstrate the potential of RVV to optimize permutation network routing for cryptographic applications. This paper provides a historical overview of permutation networks, briefly explains Bernstein’s algorithm, and discusses our vectorization strategy and ex- perimental outcomes. |
| 14:05 | Improving NoC traffic with online reconfiguration PRESENTER: Marc Grau ABSTRACT. Network-on-Chip (NoC) performance is highly sensitive to dynamic traffic patterns and contention, which are difficult to address with static design-time optimizations. This paper presents an online reconfiguration approach to improve NoC traffic efficiency by continuously monitoring runtime behavior and adapting the network accordingly. The proposed system integrates a lightweight RISC-V microcontroller with a set of distributed contention monitoring units, called SafeSU, embedded within the NoC. SafeSU units locally observe contention and traffic conditions with minimal hardware overhead and report relevant metrics to the microcontroller. Based on this information, the controller performs online analysis and triggers reconfiguration actions aimed at mitigating congestion and balancing traffic. The approach enables fine-grained, runtime-aware NoC optimization without disrupting normal operation. Experimental results demonstrate that the proposed monitoring and reconfiguration framework effectively reduces contention and improves overall NoC performance under diverse and dynamic workloads, while maintaining low area and power costs. |
| 14:25 | Research Group Forum: ScalNEXT: Exploring Smart Network Devices for High-Performance Computing ABSTRACT. In demand for more scalable and network-centric data management and control in modern HPC architectures, the ScalNEXT project explores SmartNICs and SmartSwitches, which provide active components in the network fabrics. In this paper, we highlight the methods, approaches, and results developed in this project. |