ARCS 2026: 39TH GI/ITG INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS
PROGRAM FOR THURSDAY, MARCH 26TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:15 Session 10: Organic Computing 2: Learning and Planning
09:00
From Global Objectives to Local Acceptance: Preference-Aware Centralized Planning in Organic Computing Systems

ABSTRACT. The increasing autonomy of coordinated subsystems in centralized organic computing systems amplifies each subsystem's drive to pursue its own goals. A central planner that ignores these local preferences risks being disregarded. This work investigates how local interests can be incorporated into centralized planning while maintaining a focus on global objectives. Using simulation-based experiments, we introduce two distinct behavioral subsystem types that differ in their willingness to comply with plans that deviate from their own objectives. We compare two planning approaches – one based on a greedy heuristic and the other on a constraint optimization problem (COP) solver – and integrate runtime optimization for two global objectives. We analyze how different distributions of subsystem types affect performance across global and local objectives, fairness, and system stability. Our results show that the COP-based approach achieves superior performance in heterogeneous scenarios, whereas both the greedy and COP approaches converge in homogeneous settings. Moreover, the integrated runtime optimization highlights the interplay between local satisfaction and global performance, allowing the central planner to adapt flexibly to varying subsystem demands.

09:20
MARLCoop: A Simulation Environment for Studying Emergent Cooperation in Multi-Agent Systems

ABSTRACT. Organic Computing systems often consist of self-organizing units that share a dynamic environment and must autonomously coordinate their behavior to achieve individual and collective goals. Typically, the individual subsystems - or agents - use Reinforcement Learning based controllers to make decisions within their environment. However, existing simulation environments provide limited support for studying decentralized cooperation, incentive mechanisms, and heterogeneous resource sharing in Organic Computing scenarios. This paper introduces MARLCoop, an abstract simulation environment for cooperative multi-agent reinforcement learning scenarios inspired by Organic Computing principles. In MARLCoop, agents control heterogeneous nodes with different sensing, computation, and communication capabilities and must exchange data and computational services to solve time-constrained tasks. The environment integrates Gymnasium and Ray RLlib, enabling scalable experimentation with modern learning algorithms and flexible customization of agent behavior and reward mechanisms. To demonstrate the capabilities of MARLCoop, we conduct a systematic experimental study that evaluates how different training societies influence the emergence of cooperative behavior. Our experiments analyze overall system performance, cooperation metrics, and learned incentive strategies in decentralized settings. Results show that both highly cooperative societies and fully learning-based societies foster strong cooperative behavior, while intermediate configurations may hinder coordination. The findings demonstrate MARLCoop’s potential as a research platform for investigating emergent cooperation, decentralized coordination mechanisms, and adaptive behavior in Organic Computing systems.

09:40
Modernising Reinforcement Learning–Based Navigation for Embodied Semantic Scene Graph Generation
PRESENTER: Roman Küble

ABSTRACT. Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget.

Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns.

This work presents a modular navigation module for Embodied Semantic Scene Graph Generation and modernises the decision-making component introduced for this task by upgrading the policy-optimisation method and the discrete action representation. We study compact versus high-resolution discrete action spaces and compare atomic single-head policies to factorised multi-head policies at matched resolution. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour.

Results show that replacing the optimisation algorithm alone improves SSG completeness by 21% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a high-resolution, factorised action representation yields the strongest overall completeness-efficiency trade-off.

10:00
Research Group Forum: A Machine Learning-Centric Approach to Opportunistic Sensor Integration in Organic Computing Systems

ABSTRACT. Technical systems increasingly operate in open, heterogeneous ensembles exposed to dynamic, partly unknown environments. Current machine learning-based perception assumes fixed sensor configurations and static input dimensionality, limiting exploitation of newly available external sources at runtime. This paper examines how Organic Computing systems can opportunistically integrate such dynamic sensors. Using human activity recognition and autonomous maritime navigation as examples, we identify key challenges and benefits, deriving a research agenda for adaptive model architectures, runtime sensor quality/trust assessment, and integration into OC control loops.

10:15-10:35Coffee Break
10:35-11:55 Session 11: Organic Computing 3: Applications and Experiences
10:35
Self-Organized Charging Infrastructure for VICON-based Crazyflie Drone Swarms
PRESENTER: Oliver Kosak

ABSTRACT. Swarm Intelligence and Drone Swarms are becoming increasingly important in various application domains. Continuous operation of such requires efficient and autonomous charging management. To address this challenge, we propose intelligent ROS2-based charging pads integrated in an VICON indoor tracking and positioning system we use for operating drone swarms. We present a self-organizing takeoff and landing protocol, which is based on local interaction between charging pads and drones. The contribution includes a new identification mechanism for active, single marker, infrared LED tracking in VICON which we enroll on drones and on charging pads enabling stable indoor positioning of both. We use this mechanism to allow for dynamic forming of clusters and charging management when the number of charging pads changes at runtime through dynamic routing tables. To evaluate on our approach, we use Crazylie Drones equipped with a self-developed single infrared LED module in combination with self-developed multi infrared LED equipped QI-Charging pads in a broad set of real-world experiments. The results promise stable and efficient charging management for drone swarms in dynamic indoor environments.

10:55
Uncertainty-Aware Transformer-Based Representation Learning for Vibration-Based Motor Speed Estimation

ABSTRACT. Condition monitoring of rotating machinery critically depends on the ability to infer internal system states from indirect measurements. Among available sensing modalities, vibration signals provide a uniquely information-rich and non-invasive view into mechanical dynamics, capturing subtle changes caused by load variations, structural resonances, and early-stage degradation. Estimating motor speed in revolutions per minute from high-frequency vibration data is therefore both essential and challenging due to noise, nonlinearities, and mismatched temporal resolutions between signals and labels. This work evaluates a three-stage framework combining feature extraction, representation learning using autoencoders, and deep learning regression for motor speed estimation on the Paderborn Bearing Dataset. Learned latent representations are assessed using a fixed regression head to enable a controlled comparison across models. Multiple regressors, including Long Short-Term Memory, Recurrent Neural Network, Convolutional Neural Network, Temporal Convolutional Network, Informer, and TimesNet, are evaluated on identical latent spaces. Predictive uncertainty is quantified using Monte Carlo Dropout, and percentile-based out-of-distribution detection is applied to demonstrate practical applicability. Results show that representation learning significantly improves the performance and stability of classical sequential models, while transformer-based approaches exhibit robust accuracy across all preprocessing strategies.

11:15
Online Team Formation in a Highly Flexible Multi-Robot Production Cell
PRESENTER: Christian Lehner

ABSTRACT. The growing demand for individualized mass customization requires production systems that can flexibly adapt to changing products, resources, and operating conditions. Building on previous work on self-organizing manufacturing systems and integrating the goals of Organic computing, the research project HiFlex aims to design and implement a control system for the factory of the future. Based on arbitrary assembly plans, this paper presents an approach for autonomously allocating teams of robots to assembly steps and planning the actions required for executing these steps from an arbitrary initial state. Team allocation and action planning are formulated as constraint satisfaction and optimization problems that can be solved by a general-purpose constraint solver. An evaluation on a simulated production cell shows that the system autonomously finds feasible solutions within acceptable computation times under different conditions, including limited robot availability. These results indicate that constraint-based team allocation is a promising option for self-organizing manufacturing cells.

11:35
occuFormer: Probabilistic Occupancy Forecasting at Geographically Distributed Electric Vehicle Charging Sites

ABSTRACT. The use of electric vehicles is rising quite rapidly across different modes of transportation, both in the air and on land. To accommodate the increased usage, the number of charging stations and the scale of electromobility infrastructure are also increasing rapidly. The energy demand of these charging stations is very erratic and hard to predict. Various factors, such as human lifestyle, traffic volume, time of use, weather, and neighborhood demographics, influence the energy consumption pattern. Forecasting occupancy, i.e., the number of probable EV arrivals at a specific point, is crucial not only for analyzing and planning resources such as energy and workforce, and estimating energy prices, but also for planning other services such as parking, grid stability, and energy management. This paper proposes a novel transformer-based occupancy forecasting model named occuFormer, to forecast the nextday traffic volume at geographically distributed charging stations. Experiments have been done on real-world data commonly named as ACN and PKC datasets collected in various cities in the USA and Europe. Our methods consistently outperform baseline models such as PatchTST and iTransformer.

11:55-12:55Lunch Break
12:55-14:15 Session 12: ARCS26 Main Track 4: AI acceleration and neural networks
12:55
High Performance Implementation of Convolutional Neural Networks on Versal AI Engine with STANN-AIEML
PRESENTER: Yu Li

ABSTRACT. Implementing deep learning algorithms on embedded and edge platforms requires meeting stringent constraints on energy efficiency, scalability, throughput, and real-time responsiveness. The AI Engine–Machine Learning (AIE‑ML), an ML-optimized variant of the AI Engine array in Versal adaptive SoCs, is well-suited to these demands. It provides a programmable, vector-centric compute fabric tailored to the constraints of edge devices. To exploit this architecture, we develop STANN-AIEML, an open-source hardware–software co-design workflow that supports efficient inference of deep learning models on Versal edge devices. The framework streamlines the mapping and optimization of neural networks onto AIE-ML-based architectures, enabling high-performance deployment in resource-constrained environments. In this work, we present implementations of several types of neural network layers, including one-dimensional convolutional layers, max-pooling layers, and dense layers with different data types and activation functions. In addition, a complete end-to-end AIE-ML deployment workflow is detailed. Based on the SECOM dataset targeting the detection of failures in semiconductor manufacturing, we implemented hardware accelerators with convolutional layers on the Trenz TE0950 AMD Versal AI Edge evaluation board using the STANN-AIEML workflow for classification. With an accuracy of 97.45 %, the one-dimensional convolution algorithm shows advantages for time-series data over other algorithms like random forest. Using the lightweight embedded AMD Versal AI Edge platform, an execution time of 47 µs can be achieved, more than eight times faster than the baseline using a desktop CPU.

13:15
Automated Neural Network Partitioning Across Distributed Heterogeneous Edge Accelerators

ABSTRACT. Partitioning deep neural networks across heterogeneous edge accelerators can improve latency and resource utilization, yet selecting appropriate split points requires balancing compute distribution against intermediate activation transfer. This paper presents a model-centric split-point analysis and benchmarking pipeline operating directly on ONNX graphs. The approach enumerates valid single-split boundaries in arbitrary directed acyclic graphs, estimates per-boundary activation transfer size and partition-level FLOPs (including quantized operators), and ranks candidates using a configurable multi-criteria cost function that captures communication volume, compute imbalance, and tensor- count penalties. For empirical validation, the pipeline exports selected boundaries as ex- ecutable ONNX subgraphs and automatically generates a benchmark suite. The suite verifies split correctness via an ε-pass criterion and mea- sures latency and runtime variability under a specified ONNX Runtime execution provider. Beyond single-device validation with CUDA execution, the method is deployed on the u.RECS platform, a modular heterogeneous edge AI microserver integrating multiple accelerator types. Serving as a flexible testbed for heterogeneous inference, u.RECS enables systematic evalua- tion of partitioned models across real hardware configurations. The ap- proach is evaluated on four representative networks, YOLOv7 (FP32), MobileNetV2 (FP32), ResNet-50 (INT8), and EfficientNet-Lite4 (FP32). Results demonstrate a strong correlation between partition FLOPs and measured runtime and show that graph-level cost modeling provides an effective and reproducible pre-filter for split selection in heterogeneous edge deployments.

13:35
BD-RVE: Block Dictionary Approach for ANN Acceleration Using RISC-V Extension
PRESENTER: Ahmad Menbari

ABSTRACT. Deploying deep neural networks on resource-constrained embedded systems remains challenging due to strict limitations in memory capacity and computational efficiency. RISC-V processors are an attractive platform for such systems due to their open and extensible instruction set architecture, which enables flexible customization for neural network acceleration. In this paper, we employ block dictionary quantization as an effective compression technique that represents groups of weights using shared dictionary entries, enabling considerable memory reduction while maintaining model accuracy. A design space exploration is performed across multiple compact convolutional and fully connected networks to analyze the trade-offs between block size, dictionary size, and weight bit-width under a strict accuracy loss constraint. Based on the exploration results, we identify configurations that offer a favorable balance between accuracy and memory. Using these parameters, we design an accelerator tightly coupled to a RISC-V core through custom instruction extensions. Results demonstrate speedups of up to 7× for fully connected networks and up to 11× for convolutional networks compared to a baseline implementation, demonstrating the effectiveness of the proposed approach for neural network inference acceleration.

13:55
ConvLUT: LUT-based FPGA Accelerators for Convolutional Neural Networks

ABSTRACT. Many embedded applications of deep neural networks (DNNs) impose strict requirements on throughput and latency. When deployed on edge computing devices, additional constraints arise, including efficient utilization of hardware resources and energy consumption. For DNN inference, field-programmable gate arrays (FPGAs) have proven effective in meeting these demands, particularly when models are sparsified and aggressively quantized. Lookup-table (LUT)-based DNN architectures implement neural networks exclusively using FPGA LUT resources, enabling ultra-low latency and extremely high throughput. However, their primary limitation is scalability, which restricts their applicability to small- and medium-sized DNNs in latency- and throughput-critical embedded applications. To date, all reported LUT-based architectures have focused on multi-layer perceptrons. This paper proposes ConvLUT, a novel methodology for LUT-based DNN inference accelerators targeting one-dimensional convolutional neural networks (1D-CNNs). In addition to fully connected layers, ConvLUT efficiently maps convolutional and pooling layers onto FPGA LUTs. The methodology begins with training a quantized and sparsified 1D-CNN, after which all network layers are converted into a LUT-based representation. This representation is subsequently transformed into a register-transfer level (RTL) circuit and synthesized to an FPGA.The evaluation uses the RadioML, MNIST, and electrocardiography datasets and provides a comprehensive assessment of ConvLUT architectures in terms of classification accuracy, hardware resource utilization, latency, and throughput. In addition, the specialized ConvLUT approach is compared against FINN and hls4ml, two state-of-the-art DNN compiler frameworks for FPGAs. Across all datasets, ConvLUT consistently outperforms both frameworks. For example, on the RadioML classification task, ConvLUT achieves up to 10X and 585X higher throughput than FINN and hls4ml, respectively, without any loss in accuracy.

14:15-14:35Coffee Break
14:35-16:15 Session 13: ARCS26 Main Track 5: Data structures and software engineering
14:35
HyperionR - Experiences transforming an optimized Trie index data structure from C to Rust
PRESENTER: Reza Salkhordeh

ABSTRACT. Index data structures are central to efficient data processing, but their most optimized implementations are written in memory-unsafe languages such as C and C++. Recent reports identify memory safety as a risk in the design of critical software, motivating the adoption of memory-safe alternatives. Rust is a promising systems programming language and it offers strong compile-time guarantees without significant runtime overhead. However, it remains unclear whether highly optimized index data structures that rely on low-level memory management and concurrency can be reimplemented in Rust while preserving correctness, performance, and memory efficiency.

This paper addresses this question through a case study of the Hyperion trie, a highly optimized and memory-efficient index originally implemented in C. We translate Hyperion to Rust and analyze the required implementation effort of the resulting HyperionR, the extent to which Rust’s safe subset can be used, and where unsafe code remains necessary. Our evaluation compares performance and memory consumption between both implementations and shows that most components can be ported with reasonable effort while significantly limiting unsafe code. Although HyperionR incurs moderate performance overhead, it benefits from stronger safety guarantees and improved error detection at compile time.

14:55
CDH: Accelerating Topological Mode-Seeking via Parallel Cell-Based Density Hiking
PRESENTER: Jan Schneegans

ABSTRACT. This paper presents Cell-Density-Hiking (CDH), a highly parallel clustering algorithm that utilizes the idea of topological prominence and leverages Quadtree/Octree spatial indexing to dramatically accelerate density estimation and peak detection. Unlike traditional implementations, CDH operates on aggregated cell structures rather than individual points, enabling massive parallelization, while approximating the topological structure of the underlying data manifolds. CDH is implemented as an GPU-accelerated algorithm using compute shaders, enabling efficient processing of large-scale datasets while maintaining predictable memory overhead. We evaluate CDH against leading clustering approaches, including HDBSCAN, ToMATo, as well as K-Means. Results show that CDH achieves runtime parity with K-Means while pairing the empirical performance of leading density-based clustering approaches with the robustness of topological mode-seeking algorithms. On large-scale benchmarks, CDH clusters 16 million data points in under one second, effectively providing a real-time, high-performance parallel implementation of topological data analysis for large-scale applications.

15:15
16-Bit Posits as an Alternative to Single-Precision Floats in Embedded Radar FFTs

ABSTRACT. The Fast Fourier Transform (FFT) is a fundamental step of a radar/LiDAR pipeline. Due to the high data-throughput require by these sensors, the FFT often acts as a communication-bound bottleneck, which is particularly critical in resource-constrained embedded environments where memory bandwidth and energy budgets are strictly limited. In this paper, we propose replacing single-precision 32-bit IEEE floats which are standard for maintaining numerical stability with 16-bit Posit arithmetic for FFT computation. The quality of the reduced precision is evaluated beyond conventional Signal-to-Noise Ratio (SNR) metrics with a subsequent Constant-False-Alarm-Rate (CFAR) processing stage used for target detection. Experimental results show that 16-bit Posits maintain sufficient dynamic range to preserve detection sensitivity, incurring a Normalized Mean Squared Error (NMSE) of less than 0.05% relative to the 32-bit baseline. To validate hardware feasibility, we present a VHDL implementation of a fixed-size FFT accelerator. The transition to a 16-bit representation allows a 50% reduction in required memory bandwidth, a significant decrease in energy consumption per transform. The findings suggest that Posit arithmetic is a viable candidate for next-generation embedded sensor processing, enabling higher throughput without sacrificing object detection reliability.

15:35
Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems

ABSTRACT. Microreboot enables fast recovery by restarting only the failing component, but in modern microservices naive restarts are unsafe: dense dependencies mean rebooting one service can disrupt many callers. Autonomous remediation agents compound this by actuating raw infrastructure commands without safety guarantees. We make microreboot practical by separating planning from actuation: agents propose typed remediation plans whose actions declare side-effect semantics, and a small microkernel validates and executes each plan transactionally. To determine where restart is safe, we infer recovery boundaries online from distributed traces, computing minimal restart groups and ordering constraints. On industrial traces (Alibaba, Meta) and DeathStarBench with fault injection, recovery-group inference runs in 21 ms at P99; typed actuation reduces agent-caused harm by 95% in simulation and achieves 0% harm online.

15:55
Extending Contiki-NG with AntHocNet Routing for Mobile Scenarios
PRESENTER: Thomas Böhme

ABSTRACT. The Internet of Things (IoT) is a rapidly growing field of research and development.Contiki-NG is an open-source operating system (OS) designed for resource-constrained IoT devices, providing support for the IPv6 stack and the IEEE 802.15.4-2015 standard.It is widely used in the research community. However, the OS is currently limited to the Routing Protocol for Low-Power and Lossy Networks (RPL), which does not perform well in dynamic network scenarios.

Routing is challenging in mobile ad hoc networks, due to their dynamic nature and the lack of a central authority to manage routes.Different nature-based routing protocols have been proposed to address these challenges, among them AntHocNet.It is a hybrid routing protocol that combines both proactive and reactive routing strategies and is based on the ant colony optimization framework.The successful integration of AntHocNet into Contiki-NG provides a new routing option for IoT applications.

The goal of this paper is to make the AntHocNet routing protocol part of the Contiki-NG operating system.For this, the protocol is implemented and integrated into the Contiki-NG code base and is made publicly available to enable further research.Then, the performance of AntHocNet is evaluated in different simulated scenarios using the Cooja simulator and compared to RPL.

It was found that Contiki-NG's existing architecture is not fully optimized for the integration of new routing algorithms, and is clearly designed for use with RPL.The simulation results show that AntHocNet is able to establish routes in different scenarios.In highly dynamic scenarios, AntHocNet shows a substantially worse performance than in static ones.However, it still performs better than RPL in terms of packet delivery ratio.