ARCS 2026: 39TH GI/ITG INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS
PROGRAM FOR WEDNESDAY, MARCH 25TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 5: Keynote #2
09:00
Vector Evolution: With RISC-V from HPC to AI

ABSTRACT. Vector architectures have a rich history, traditionally serving as the computational backbone for High-Performance Computing (HPC). Today, as artificial intelligence workloads demand unprecedented data-level parallelism, vector processing is undergoing a powerful renaissance. At the center of this shift is RISC-V. Thanks to its inherent modularity, the RISC-V Vector Extension (RVV) provides a highly flexible foundation for a diverse range of modern architectures. This talk explores the evolutionary leap from traditional vector implementations to multi-dimensional computation. It will discuss the conceptual shift of using vectors as matrix tiles and navigate the complex landscape of RISC-V matrix operation standardization. Drawing from the front lines of ISA development, the talk will highlight the "adventures" and milestones involved in the specification and ratification of the Integrated Matrix Extension (IME). Finally, moving from specification to silicon, we will examine Openchip’s approach to vector, matrix, and System-in-Package (SiP) integration, demonstrating how RISC-V is uniquely positioned to dominate the future of both HPC and AI architectures.

10:00-10:30Coffee Break
10:30-12:05 Session 6: ARCS26 Main Track 2: FPGA and custom hardware accelerators
10:30
FLICS: FPGA-acceLerated Ingestion of documents into Columnar Storage

ABSTRACT. The increasing demand for efficient data processing has led to the development of various data storage formats and processing techniques. Columnar storage formats, such as Apache Parquet, are widely used for their efficiency in handling analytical queries on nested documents. However, incoming data is often encoded in row-based formats, necessitating a transformation into a columnar format before analytical queries can be applied. This transformation is computationally intensive and can introduce significant resource and performance overheads. As a remedy, we introduce FLICS, an FPGA-accelerated solution for parsing and ingesting row-oriented Avro data into the columnar Parquet storage format. FLICS leverages an application-specific instruction-set processor (ASIP) architecture to parse Avro data and convert it into the columnar format. The ASIP is controlled by an instruction program tailored to parse a specific schema. Consequently, any schema modification merely necessitates loading a new instruction sequence into the instruction memory. Additionally, we introduce a schema compiler that automatically generates these instruction programs, facilitating seamless schema transitions. Our evaluation demonstrates that FLICS achieves up to 3.3 times higher throughput compared to state-of-the-art software solutions running on a fully utilized 24-core CPU, while using only a small fraction of the available FPGA resources.

10:50
A Flexible Open-Source Framework for FPGA-based Network-Attached Accelerators using SpinalHDL

ABSTRACT. Domain-specific accelerators are increasingly vital in heterogeneous computing systems, driven by the demand for higher computational capacity and especially energy efficiency. Network-attached FPGAs promise a scalable and flexible alternative to closely coupled FPGAs for integrating accelerators into computing environments. While the advantages of specialized hardware implementations are apparent, traditional hardware development and integration remain time-consuming and complex.

We present an open-source framework which combines a hardware shell with supporting software libraries, which enables fast development and deployment of FPGA-based network-attached accelerators. In contrast to traditional approaches using VHDL or Verilog, we leverage generative programming with SpinalHDL, providing a flexible hardware description with multi-level abstractions. This work eases the integration of accelerators into existing network infrastructures and simplifies adaptation to different FPGAs, eliminating complex and lengthy top-level hardware descriptions.

11:10
FPGA-Based Energy-Efficient Decision tree ensemble Accelerator

ABSTRACT. The growing complexity of artificial intelligence techniques and the increasing size of modern datasets have led to a substantial rise in energy consumption in conventional implementations optimized for CPUs and GPUs. Field Programmable Gate Arrays represent a promising alternative, as they enable customized hardware acceleration that improves both latency and performance while maintaining high energy efficiency. This paper presents an FPGA-based architecture for accelerating tree ensemble inference, with emphasis on energy efficiency and resource optimization. The proposed system integrates a Python-based training workflow that generates a flexible configuration file, enabling adaptable deployment across varying numbers of trees and datasets characteristics. The architecture exploits fine-grained parallelism by executing decision trees concurrently and aggregating their predictions through a hardware-based majority voting scheme. Experimental evaluation on a Zynq-based platform demonstrates low resource utilization, requiring less than 4\% flip-flops and 16\% LUTs, while BRAM becomes the main limiting factor, reaching 70\% for a 64-tree configuration. The total power dissipation is approximately 1.6 W of which only 0.2 W corresponds to the programmable logic implementing the accelerator. Finally, we found competitive inference latency compared to related FPGA implementations. These results confirm the suitability of the design for energy-constrained embedded AI applications.

11:30
OpenEye: A scalable Open-Source Hardware Accelerator for DNNs

ABSTRACT. The increasing computational complexity of deep neural network inference poses significant challenges for efficient hardware acceleration on embedded platforms, particularly with respect to resource consumption and scalability. This work presents OpenEye, a scalable and sparsity-aware FPGA-based hardware accelerator designed to efficiently execute common neural network operations such as convolutions, dense layers, and pooling. OpenEye is based on a highly parameterizable architecture composed of clusters of processing elements (PEs) interconnected by a streaming-based dataflow. The paper provides a detailed explanation of the internal operation of the accelerator, including data movement, buffering strategies, control logic, and the coordination between clusters and PEs. The architecture natively supports sparse weights and activations, enabling the efficient processing of sparse data without unnecessary computations or memory accesses. A key design property of OpenEye is its scalability: the number of clusters and processing elements can be varied to adapt the accelerator to different performance and resource constraints. The design achieves a near-linear scaling of routing and interconnect overhead with increasing PE counts, which is essential for maintaining efficiency on large FPGA devices. To evaluate scalability across different design points, multiple OpenEye configurations with varying cluster and PE sizes were implemented on a Xilinx ZU19EG FPGA. Representative neural network operations, including convolutional, fully connected, and pooling layers, were used to analyze resource utilization, execution latency, and scalability behavior. The results show favorable trade-offs between performance and resource consumption across the explored configurations.

11:50
Research Group Forum: Runtime-Adaptive Cross-Level Techniques for Efficient and Reliable Edge AI

ABSTRACT. Deep neural networks (DNNs) have achieved remarkable accuracy in decision-making tasks and are increasingly deployed in safety-critical edge AI applications. While accuracy remains a key requirement in such domains, edge computing platforms face strict constraints in terms of silicon area, power budget, and memory capacity, which limit their ability to accommodate the growing complexity of modern DNN models. Moreover, real-time constraints impose additional trade-offs between inference latency, computational depth, and reliability. Although extensive research has been conducted on cost-efficient reliability mitigation and model optimization techniques, such as pruning and quantization, their effectiveness is gradually saturating due to their reliance on static, one-size-fits-all optimization strategies that lack adaptivity to input, workload, and hardware conditions. To address these challenges, adaptive and cross-level optimization strategies are required to move beyond static, one-size-fits-all designs and enable flexible computation on fixed silicon. This paper presents three complementary techniques targeting cost efficiency and reliability enhancement of DNN inference across the model level, hardware architecture level, and arithmetic level. By jointly exploiting dynamic neural execution, hardware-aware optimization, and adaptive approximate arithmetic, the proposed approaches introduce runtime adaptability to varying input complexity, workload conditions, and hardware constraints. Experimental results demonstrate that each proposed technique significantly improves performance and robustness compared to conventional static and single-level optimization methods, making the approach well suited for safety-critical edge AI applications.

12:05-13:05Lunch Break
13:05-14:40 Session 7: ARCS26 Main Track 3: Networking, DPUs and interconnections
Chair:
13:05
Bringing a L3 Load Balancer Closer to the Network; Experiments on the Nvidia Bluefield-3 SmartNIC

ABSTRACT. With the advent of programmable network hardware, more functionality can be moved from software running on general purpose CPUs to the NIC. Early NICs only allowed offloading fixed functions like checksum computation and encryption. Recent NICs like the Nvidia Bluefield-3 allow a fully programmable dataplane. In this paper, we present our first steps towards a load balancer running on the on-path "Accelerated Programmable Pipeline" of the Bluefield-3. Our results show that for small packets the Bluefield-3 does not achieve line rate even with only 2 entries in a Flow Pipe, but we achieve a 44% lower latency compared to a comparable eBPF-based load balancer running on the host, even under high load. Furthermore, we show the capabilities and the limits of the Bluefield-3 "Accelerated Programmable Pipeline".

13:25
Adaptive Task Offloading to BlueField-2 DPU using INTadapt

ABSTRACT. In the context of Modern High-Performance Computing (HPC) nodes often suffer from resource contention, where Operating System noise, Message Passing Interface (MPI) synchronization, and data movement consume valuable CPU cycles. In the conventional systems, network interfaces act as passive data pipes, leaving the intelligent decision-making to the host CPU. Although SmartNICs offering programmable cores are available, integrating them into existing scientific workflows remains challenging due to complex programming models. This paper presents INTadapt, a transparent framework designed to extend StarPU’s scheduling capabilities by enabling intelligent, proactive task offloading to SmartNICs.Our system integrates HALadapt’s reactive history-tracking and memory abstraction layers to dynamically identify network-intensive kernels. These kernels are then migrated to NVIDIA’s InfiniBand BlueField Data Processing Unit (DPU)’s ARM cores using runtime binary interception, allowing existing HPC workloads to leverage DPU acceleration without manual source-code refactoring or specialized DPU programming. Experimental results show that our approach achieves a ~60% reduction in execution latency for offloaded network tasks. Crucially, we report that by offloading these operations, Host CPU utilization for communication-related overheads drops to near 0%. This strategic release of hardware resources allows the Host to concurrently execute heavy computational kernels—such as scientific math libraries—that would otherwise remain queued. Consequently, we achieve a near-complete computation-communication overlap, transforming the DPU into an active asynchronous worker that enhances the overall throughput of legacy HPC applications.

13:45
Speeding Up Berstein's Formulas for Permutation Networks

ABSTRACT. Permutation networks, such as Beneš and Waksman net- works, are pivotal in high-speed data routing for applications such as parallel computing, optical switching, and cryptography, where they en- able efficient data shuffling. The performance of these networks hinges on the efficient computation of control bits that configure their switch settings. We propose a vectorized implementation of Bernstein’s verified fast formulas [6] that uses RISC-V Vector Extensions (RVV) to accelerate the computation. Our approach is validated on the Ibex processor paired with the Vicuna vector coprocessor, comparing cycle counts, instruction counts, and memory footprints of non-vectorized and vectorized imple- mentations. The results demonstrate the potential of RVV to optimize permutation network routing for cryptographic applications. This paper provides a historical overview of permutation networks, briefly explains Bernstein’s algorithm, and discusses our vectorization strategy and ex- perimental outcomes.

14:05
Improving NoC traffic with online reconfiguration
PRESENTER: Marc Grau

ABSTRACT. Network-on-Chip (NoC) performance is highly sensitive to dynamic traffic patterns and contention, which are difficult to address with static design-time optimizations. This paper presents an online reconfiguration approach to improve NoC traffic efficiency by continuously monitoring runtime behavior and adapting the network accordingly. The proposed system integrates a lightweight RISC-V microcontroller with a set of distributed contention monitoring units, called SafeSU, embedded within the NoC. SafeSU units locally observe contention and traffic conditions with minimal hardware overhead and report relevant metrics to the microcontroller. Based on this information, the controller performs online analysis and triggers reconfiguration actions aimed at mitigating congestion and balancing traffic. The approach enables fine-grained, runtime-aware NoC optimization without disrupting normal operation. Experimental results demonstrate that the proposed monitoring and reconfiguration framework effectively reduces contention and improves overall NoC performance under diverse and dynamic workloads, while maintaining low area and power costs.

14:25
Research Group Forum: ScalNEXT: Exploring Smart Network Devices for High-Performance Computing

ABSTRACT. In demand for more scalable and network-centric data management and control in modern HPC architectures, the ScalNEXT project explores SmartNICs and SmartSwitches, which provide active components in the network fabrics. In this paper, we highlight the methods, approaches, and results developed in this project.

14:40-15:00Coffee Break
15:00-16:00 Session 8: PhD Forum
15:00
A Modular Multi-Sensor Fusion Framework for Autonomous Maritime Navigation

ABSTRACT. Autonomous navigation in maritime environments requires reliable perception and situational awareness under heterogeneous, asynchronous and often incomplete sensor observations. Sensor fusion therefore constitutes a central component for maritime autonomy, yet practical deployment is challenged by environmental variability, real-time constraints and the integration of diverse sensors. This paper outlines an ongoing PhD project that investigates modular, event-driven sensor fusion as a foundation for autonomous maritime navigation, extended by auxiliary components to further increase robustness and accuracy. The doctoral research is motivated by the need for extensible and interpretable perception architectures that can work with a multitude of sensor suites, own-vessel specifications and changing environments. A system-centred methodological approach is adopted, in which a sensor-specific pipeline and a sensor-agnostic module are developed to generate target representations from asynchronous sensor inputs. The paper outlines the context, research questions and methodology of the doctoral project and positions the current implementation and preliminary real-world observations as a foundation for subsequent PhD contributions.

15:15
Real-Time System Awareness and Adaptation for Autonomous Systems

ABSTRACT. The increasing autonomy of technical systems requires robust perception, monitoring, and adaptation capabilities to operate safely in dynamic and partially unknown environments. This PhD research investigates advanced condition monitoring as a key enabler for self-adaptive and self-organizing autonomous systems. The focus lies on the autonomous acquisition, fusion, and interpretation of heterogeneous, high-frequency sensor data to monitor both environmental and internal system states, including sensor degradation and system anomalies.

Particularly in autonomous industrial systems and autonomous driving, real-time state estimation and prediction are essential to enable rapid reactions and prevent failures or accidents. Building on these monitored observations, the research explores mechanisms for fast self-adaptation and system reorganization, enabling decision-making under uncertainty. To address unforeseen situations, the work investigates generalization strategies such as few-shot and zero-shot learning. Additionally, concepts from explainable artificial intelligence (XAI) are incorporated to improve transparency and trustworthiness of autonomous decisions. Overall, the research aims to integrate perception, learning, and control into a cohesive framework for reliable real-time autonomy.

15:30
A Comprehensive Study of Communication Bottlenecks and Transmission Strategies in 3D Network-on-Chips for heterogeneous Monolithic 3D SoCs

ABSTRACT. State-of-the-art manycore 3D System-on-Chips (SoCs) using monolithic 3D (M3D) integration technology shift the primary performance bottleneck from computation to the communication infrastructure. 3D Network-on-Chips (NoCs) represent a promising communication architecture for M3D SoCs, and as a consequence the optimization pressure increasingly focuses on the design and evaluation of 3D NoCs. M3D integration introduces both substantial challenges and new optimization opportunities due to technology specific fabrication constraints. The dominant challenge arises from pronounced heterogeneity in maximum clock frequencies, traffic distributions, network topologies, and link delay characteristics, which complicates the design of conventional 3D NoCs. At the same time, the combination of different technology nodes across layers and high performance vertical links introduces new opportunities for communication optimization. This thesis presents a comprehensive analysis of heterogeneous 3D NoC architectures that explicitly accounts for technology induced degradations as well as available architectural and physical design options. Both established and emerging transmission techniques and architectures are systematically evaluated with the goal of overcoming performance limitations in the communication network under realistic M3D constraints. Particular attention is devoted to wave pipelining as a promising transmission strategy, demonstrating substantial performance improvements across a wide range of system configurations and traffic conditions.

15:45
Research Group Forum: Towards HW/AI Co-design of Resource-efficient Hardware Accelerators

ABSTRACT. Artificial intelligence (AI) algorithms, such as deep neural networks (DNNs), are typically developed assuming abundant computational resources, which limits their direct applicability on edge hardware, including microcontrollers, ASICs, FPGAs, and/or GPUs. To bridge this gap, this doctoral project investigates a holistic hardware/AI (HW/AI) co-design framework that jointly considers AI model optimizations, hardware-level approximations, and automated design space exploration (DSE) to enable efficient and feasible edge deployment. Within this framework, the research focuses on approximation-aware HW/AI co-design and scalable exploration of the resulting combined design space. AI-level optimizations, such as quantization and pruning, are coordinated with hardware-level approximations, including approximate operators and microarchitectural optimizations, and evaluated through a learning-based DSE process. Current progress includes preliminary HW/AI co-approximation studies, a scalable DSE methodology for approximate accelerator synthesis, and a system-level co-design framework. Together, these contributions demonstrate the feasibility of coordinated HW/AI optimization and scalable exploration, forming a foundation for future extensions toward efficient edge AI deployment across heterogeneous platforms.

16:00-16:30Coffee Break
16:30-17:30 Session 9: Dependability in the Era of Hardware-Accelerated AI Systems
16:30
Error-Correction Encoding for Covert Channel in Bloom Filter

ABSTRACT. We investigate an indirect covert channel that stores information in a Bloom filter. Each coordinate in the bit vector that represents the secret message is assigned a unique identifying element, and for those positions that contain a value of 1, we insert the identifying element to the Bloom filter. To recover the vector, the Bloom filter is queried for the identifying elements. However, errors may occur: if a queried element is not stored in the filter, then with small probability the filter replies 1 instead of 0. This results in an asymmetric binary channel. We propose an encoding scheme, which in addition to providing error correction for the asymmetric binary channel, strives to minimize the number of 1-bits in the codewords, i.e., it adds as few new elements as possible to the filter to remain stealthy and not increase the error probability further. We compare several approaches to detect or correct a specified number of bit-flips, both analytically and experimentally.

16:50
SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

ABSTRACT. Spiking neural networks (SNNs) offer inherent energy efficiency due to their event-driven computation model, making them promising for edge AI deployment. However, their practical adoption is limited by the computational overhead of deep architectures and the absence of input-adaptive control. This work presents \textbf{SPARQ}, a unified framework that integrates spiking computation, quantization-aware training, and reinforcement learning–guided early exits for efficient and adaptive inference. Evaluations across MLP, LeNet, and AlexNet architectures demonstrated that the proposed Quantised Dynamic SNNs (QDSNN) consistently outperform conventional SNNs and QSNNs, achieving up to \textbf{3.3\% higher accuracy}, \textbf{4× lower energy}, and over \textbf{90× fewer operations} on different datasets. These results validate SPARQ as a hardware-friendly, energy-efficient solution for real-time AI at the edge.

17:10
Are DNA Molecules Fault-Tolerant?

ABSTRACT. The submission is currently an abstract Pleaser refer to the PDF document.