Program for Wednesday, August 27th

PROGRAM FOR WEDNESDAY, AUGUST 27TH

Days:

previous day

next day

all days

View: session overview talk overview

09:00-09:30 Session 33: Opening Session

09:30-10:30 Session 34: Keynote 1: Martin Schulz

10:30-11:00Coffee Break

11:00-12:30 Session 35A: Track 2.1: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows

11:00	Daniel Medeiros (KTH Royal Institute of Technology, Sweden) Jeremy Williams (KTH Royal Institute of Technology, Sweden) Jacob Wahlgren (KTH Royal Institute of Technology, Sweden) Leonardo Saud Maia Leite (KTH Royal Institute of Technology, Sweden) Ivy Peng (KTH Royal Institute of Technology, Sweden) ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments ABSTRACT. Existing state-of-the-art vertical autoscalers for containerized environments are traditionally built for cloud applications, which might behave differently than HPC workloads with their dynamic resource consumption. In these environments, autoscalers may create an inefficient resource allocation. This work analyzes nine representative HPC applications with different memory consumption patterns. Our results identify the limitations and inefficiencies of the Kubernetes Vertical Pod Autoscaler (VPA) for enabling memory elastic execution of HPC applications. We propose, implement, and evaluate ARC-V. This policy leverages both in-flight resource updates of pods in Kubernetes and the knowledge of memory consumption patterns of HPC applications for achieving elastic memory resource provisioning at the node level. Our results show that ARC-V can effectively save memory while eliminating out-of-memory errors compared to the standard Kubernetes VPA.
11:20	Thomas Jakobsche (University of Basel, Switzerland) Osman S. Simsek (University of Basel, Switzerland) Jim Brandt (Sandia National Laboratories, United States) Ann Gentile (Sandia National Laboratories, United States) Florina M. Ciorba (University of Basel, Switzerland) An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment ABSTRACT. High Performance Computing (HPC) systems rely on fixed user-provided estimates of job time limits. These estimates are often inaccurate, resulting in inefficient resource use and the loss of unsaved work if a job times out shortly before reaching its next checkpoint. This work proposes a novel feedback-driven autonomy loop that dynamically adjusts HPC job time limits based on checkpoint progress reported by applications. Our approach monitors checkpoint intervals and queued jobs, enabling informed decisions to either early cancel a job after its last completed checkpoint or extend the time limit sufficiently to accommodate the next checkpoint. The objective is to minimize tail waste, that is, the computation that occurs between the last checkpoint and the termination of a job, which is not saved and hence wasted. Through experiments conducted on a subset of a production workload trace, we show a 95% reduction of tail waste, which equates to saving approximately 1.3% of the total CPU time that would otherwise be wasted. We propose various policies that combine early cancellation and time limit extension, achieving tail waste reduction while improving scheduling metrics such as weighted average job wait time. This work contributes an autonomy loop for improved scheduling in HPC environments, where system job schedulers and applications collaborate to significantly reduce resource waste and improve scheduling performance.
11:40	Rajat Bhattarai (Tennessee Tech University, United States) Howard Pritchard (Los Alamos National Laboratory, United States) Sheikh Ghafoor (Tennessee Tech University, United States) Enabling Elasticity in Scientific Workflows for High Performance Computing Systems ABSTRACT. Modern research relies on scientific workflows to manage complex computational tasks across high-performance computing (HPC) systems. Most HPC systems still use a largely static resource allocation strategy, which means resources are fixed during runtime. However, modern workflows are increasingly data and event-driven, especially with the integration of artificial intelligence tasks. Static workflows lack the flexibility to adapt at runtime to evolving workflow and system demands, often impacting resource utilization and overall efficiency. To address these concerns, scientific workflows require compute elasticity, the capability to dynamically adjust resource usage based on computational load and system availability. Elastic workflows can enhance computational efficiency by leveraging adaptive scheduling algorithms and smart resource management strategies. However, implementing elastic workflows in HPC environments is challenging, as it requires modifications across the entire HPC software stack, including workflow managers, resource managers, and middleware layers. This paper presents the design and implementation of an elastic workflow framework for HPC, highlighting key challenges and design considerations. Our framework is based on an elastic Parsl workflow manager and a custom elastic resource manager, both leveraging capabilities of an implementation of the Process Management Interface for Exascale (PMIx) API as middleware. Through real-world and synthetic workflow case studies, we demonstrate improved resource utilization, reduced execution times, showcasing the benefits of elasticity in HPC workflows.
12:00	Marta Navarro (Universitat Politècnica de València, Spain) Vicent Pallardó-Julià (Universitat de València, Spain) Salvador Petit (Universitat Politècnica de València, Spain) Maria Gomez (Universitat Politècnica de València, Spain) Julio Sahuquillo (Universitat Politècnica de València, Spain) WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy ABSTRACT. Simultaneous multithreading (SMT) processors improve system throughput by sharing core resources among the threads running on the same core. However, intra-core interference can cause co-running applications to degrade each other's performance significantly. To address this issue, some approaches have focused on balancing contention at the core shared resources (e.g. the shared L1 data cache). A key advantage of these approaches is that they are workload-agnostic. Other approaches improve the previous ones by modeling the inter-application interference across the intra-core shared resources. Unfortunately, these approaches require off-line model training for specific workloads. This paper presents WAPA, a CPI-based thread-to-core allocation approach that incorporates the best of both worlds. WAPA is a workload-agnostic policy that implicitly accounts for inter-thread interference across all the shared resources by leveraging the CPI. The proposed approach relies on the optimal transport (OT) theory, a mathematical theory to dynamically select symbiotic pairs of applications. Experimental results in an Intel Xeon show that WAPA outperforms the default Linux scheduler on average by 8.4% in IPC in workloads dominated by cache and main memory latencies, while performance gains of existing approaches are below 3.2%.

11:00-12:30 Session 35B: Track 3.1: Neural Network Acceleration and Optimization

11:00	Yudong Mu (Institute of Computing Technology, Chinese Academy of Sciences, China) Zhihua Fan (Institute of Computing Technology, Chinese Academy of Sciences, China) Xiaoxia Yao (China Mobile Research Institute, China) Wenming Li (Institute of Computing Technology, Chinese Academy of Sciences, China) Zhiyuan Zhang (Insitute of Computing Technology, Chinese Academy of Sciences, China) Honglie Wang (Institute of Automation, Chinese Academy of Sciences, China) Xuejun An (Insitute of Computing Technology, Chinese Academy of Sciences, China) Xiaochun Ye (Institute of Computing Technology, Chinese Academy of Sciences, China) FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference ABSTRACT. Diffusion models have emerged as powerful tools for generative AI tasks. While prior research primarily focuses on eliminating redundancy across timesteps, models like Stable Diffusion introduce a ResNet-Transformer Alternating Execution (RTAE) Pattern, where convolution and attention operators execute sequentially within each timestep. This execution pattern creates two major challenges: (1) excessive on-chip memory access due to frequent data movement between convolution and Transformer operations, and (2) underutilization of computational resources caused by the distinct processing characteristics of these two operators. To tackle these challenges, we propose FDHA, an accelerator designed for efficient diffusion model inference. First, to mitigate redundant on-chip memory access, FDHA introduces an inter-operator dataflow fusion mechanism that strategically aligns ResNet's convolution and Transformer's matrix multiplication dimensions, enabling efficient kernel reuse. Second, to maximize computational resource utilization, FDHA employs a heterogeneous architecture with dedicated Processing Elements for convolutions and Tensor Processing Elements for matrix multiplications, allowing for pipelined execution. Experimental results demonstrate that FDHA achieves 3.28x speedup over an NVIDIA A100 GPU and 2.62x speedup over a SoTA diffusion accelerator.
11:20	Jiale Dong (University of Science and Technology of China, China) Hao Wu (University of Science and Technology of China, China) Zihao Wang (University of Science and Technology of China, China) Wenqi Lou (University of Science and Technology of China, China) Zhendong Zheng (University of Science and Technology of China, China) Lei Gong (University of Science and Technology of China, China) Chao Wang (University of Science and Technology of China, China) Xuehai Zhou (University of Science and Technology of China, China) CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA ABSTRACT. Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28\% accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35x improvement in throughput, and over 80\% energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining <1\% accuracy loss across vision benchmarks. The code will be open-sourced.
11:40	Joonyup Kwon (Korea University, South Korea) Jinhyeok Choi (Korea University, South Korea) Ngoc-Son Pham (Korea University, South Korea) Sangwon Shin (Korea University, South Korea) Taeweon Suh (Korea University, South Korea) SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration ABSTRACT. This paper proposes SkipNZ, a novel approach to reduce computational demands with negligible accuracy loss, improving the speed of sparse tensor convolutional neural networks. SkipNZ extends existing zero-value skipping techniques and enables skipping MAC operations for non-zero values. The evaluation results demonstrate that the proposed technique reduces computational load and improves overall performance with negligible accuracy loss. Compared to baseline, SkipNZ reduces execution time to 0.71× in AlexNet with Gap9 with 0.1% accuracy loss. Also, VGG16 with Gap8 achieves the reduction to 0.78× with 0.0% accuracy loss. The reduction in execution time comes from filtering out unnecessary MAC operations introduced by SkipNZ. The synthesized result shows that the additional hardware required by SkipNZ is only increased by 0.005% compared to the baseline.
12:00	Piyumal Ranawaka (Chalmers University of Technology, Sweden) Per Stenstrom (Chalmers University of Technology, Sweden) BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators ABSTRACT. Multi-DNN accelerators enable the simultaneous execution of multiple DNN workloads, enhancing performance by overlapping computations and memory access of multiple DNN workloads. However, this increases the demand on on-chip memory, which must handle the footprints of several workloads. Batching allows multiple consecutive DNN inferences from the same model to share weights, improving weight reuse and reducing off-chip access costs over a batch. However, for batching to be effective, input feature maps of all inferences in the batch must fit in on-chip memory. Thus, the offline-determined batch size of each workload can exhaust available on-chip memory when multiple workloads share it, resulting in reduced performance. This paper introduces BATCH-DNN, a dynamic method for adapting batch size at a layer-by-layer basis to available on-chip memory, enhancing multi-DNN accelerator performance. It employs two techniques: adaptive cascaded sub-batching and adaptive sub-batch merging. Offline profiling assesses the footprint, while run-time adjustment establishes the maximum batch size on a layer-by-layer basis based on available on-chip memory.

11:00-12:30 Session 35C: Track 6.1: Memory and I/O Systems

11:00	Yisu Wang (HKUST (GZ), China) Xinjiao Li (HKUST (GZ), China) Ruilong Wu (HKUST (GZ), China) Huangxun Chen (HKUST (GZ), China) Dirk Kutscher (HKUST (GZ), Germany) NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning ABSTRACT. Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput—ultimately affecting convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information. This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network performance, NetSenseML optimizes the balance between reducing data payloads and preserving model accuracy, minimizing the trade-offs between transmission efficiency, convergence speed, and performance. Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. We present the design of the NetSenseML adaptive data reduction function and experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55 to 9.84× compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.
11:20	John W. Romein (Stichting ASTRON (Netherlands Institute for Radio Astronomy), Netherlands) Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU ABSTRACT. Radio telescopes produce enormous amounts of data. Many of them use GPU clusters to combine the digitized antenna signals, usually in real time. Achieving high data rates is challenging: the PCIe bandwidth of discrete GPUs is limited, and without RDMA, handling 200 or 400~Gb/s Ethernet packets with telescope data is difficult. The NVIDIA Grace Hopper is a novel, innovative system that eliminates the I/O bottleneck of traditional, discrete GPUs by using NVLink instead of PCIe. This opens the door to higher data rates, but faster hardware alone is not enough. In this paper, we combine hardware and software innovations to process Ethernet packets at no less than 1.2~Tb/s, a huge improvement over what was previously possible. We use the Data Plane Development Kit to minimize the receive overhead, and use a new feature that allows packet processing directly by the GPU. We demonstrate the data handling in a correlator application, analyze the performance, and show how to reduce the energy use. The presented innovations enable the use of GPUs for more powerful telescopes with much higher data rates. The results are also of interest to (GPU) applications from other application domains with high I/O demands, especially if RDMA is not available.
11:40	Tianyu Wan (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China) Shijia Gong (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China) Yangyang Hu (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China) Jianxi Chen (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China) GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory ABSTRACT. Disaggregated memory separates compute and storage nodes into two independent memory pools, connected via RDMA or CXL links. Disaggregated memory improves resource utilization, saves cost overhead, and ensures elastic scalability of memory and compute resources. Tree indexes are essential guarantees in storage systems, such as databases or KV storage. Existing disaggregated memory systems suffer from poor write performance, mainly due to concurrency conflicts, frequent Structure Modify Operation(SMO) operations, and high lock overhead on tree index. To solve this problem, we propose GECKO, a write-optimized Adaptive Radix Tree(ART) index structure for disaggregated memory. We leverage 1) a write-optimized buffer node to handle concurrent writes, improving write performance, 2) a threshold-based splitting strategy to reduce splits and optimize SMO operations, 3) a post-insertion lock design to reduce lock overhead and reduce insertion tail latency. We compare GECKO with state-of-the-art solutions, experiments show that we improves throughput by 1.43×-3.21× under write workloads while SMO operation time is reduced by 88.5%, and lock time is reduced by 87.9%.
12:00	Jhonatan Cléto (Universidade Estadual de Campinas (UNICAMP), Brazil) Guilherme Valarini (Universidade Estadual de Campinas (UNICAMP), Brazil) Marcio Pereira (Universidade Estadual de Campinas (UNICAMP), Brazil) Guido Araujo (Universidade Estadual de Campinas (UNICAMP), Brazil) Hervé Yviquel (Universidade Estadual de Campinas (UNICAMP), Brazil) Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication ABSTRACT. Heterogeneous multi-node clusters with accelerators, such as GPUs, are increasingly the standard for HPC, where the MPI+OpenMP approach is commonly used for application development. However, this approach poses significant challenges for developers, especially in managing communication, synchronization, and load balancing across distributed nodes and accelerators. To address these challenges, this paper proposes the MPI Proxy Plugin (MPP), an extension of the LLVM OpenMP Offloading runtime that transparently offloads OpenMP target regions to remote accelerators via MPI. By abstracting communication and using the asynchronous mechanisms of MPI with C++20 coroutines, MPP provides a scalable alternative to MPI+OpenMP, enabling simpler development of heterogeneous HPC applications through the familiar OpenMP programming model. Experimental results shows that MPP achieves excellent scalability. For compute-intensive proxy-application, it scales nearly linearly, reaching a 63x speedup from 1 to 64 GPUs. While naive data transfers can degrade performance, this research reveals that extending OpenMP Target with collective operations (e.g., multi-device broadcast) simplifies development and improves performance, achieving up to 7x speedup in communication-bound benchmarks.

12:30-14:00Lunch Break

14:00-15:00 Session 36A: Track 1.1: Performance Analysis and Simulation

14:00	Anna-Lena Roth (Hochschule Fulda, University of Applied Sciences, Germany) David James (Hochschule Fulda, University of Applied Sciences, Germany) Michael Kuhn (Otto von Guericke University Magdeburg, Germany) Dustin Frisch (Hochschule Fulda, University of Applied Sciences, Germany) Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights ABSTRACT. In-depth understanding of collective MPI communication is a major challenge for both beginners and experienced developers. It is challenging to grasp the operations of collective algorithms, as many performance analysis tools cannot break down collective operations into their underlying point-to-point communication. However, collective communication is a key factor in optimizing the performance of parallel programs. EduMPI is a novel tool for parallel programming education, providing near-real-time visualization of collective MPI algorithms. It displays per-process data exchange and highlights performance issues like Late Sender and Late Receiver. This paper introduces the visualization of collective communication in EduMPI, demonstrates how this representation aids in understanding the concept of collective communication and its relevance for performance optimization, and evaluates its effectiveness in an educational context. EduMPI bridges the gap in understanding complex collective MPI operations by providing a transparent view of the underlying processes, enabling students to visualize and better understand the communication flow.
14:20	Jaewoo Son (Seoul National University, South Korea) Youngchul Yoon (Seoul National University, South Korea) Soonhoi Ha (Seoul National University, South Korea) TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems ABSTRACT. Compute Express Link (CXL) is recognized as a revolutionary technology in high-performance computing (HPC) system design, driven by the growing demand for efficient and scalable memory solutions tailored to memory-centric workloads. However, despite its potential, evaluating CXL performance in real-world scenarios is challenging due to the lack of CXL hardware and the high costs of building a large-scale distributed system. To address this, we propose TSim4CXL, a novel trace-driven simulation framework for CXL-based HPC systems that provides accurate timing simulations within a practical timeframe. TSim4CXL separates computing resources from the CXL memory system, generating traces and simulating the memory system using SystemC’s discrete-event modeling. By modeling the CXL interconnect at the protocol level with various configuration parameters, TSim4CXL allows us to explore the design space of HPC architecture. The accuracy of our CXL simulation model is validated using CXL hardware provided by Samsung Electronics. First, we compare load latencies using a custom microbenchmark on CXL hardware with simulation results and adjust the CXL parameters in our simulator accordingly. Second, we assess communication latency by running LAMMPS applications, ensuring the simulation results align with real-world performance. In addition, we perform design space exploration with two memory-centric applications, up to 25 CPU nodes for LAMMPS and 4 GPU nodes for LLM training. Furthermore, we compare the performance of target applications by executing multiple DRAM simulators, demonstrating how the memory bandwidth affects simulated time. These experiments prove the viability of the proposed simulation framework.
14:40	Solomon Bekele (Argonne National Laboratory, United States) Aurelio Vivas (University De Los Andes - Colombia, Colombia) Thomas Applencourt (Argonne National Laboratory, United States) Kazutomo Yoshii (Argonne National Laboratory, United States) Swann Perarnau (Argonne National Laboratory, United States) Servesh Muralidharan (Argonne National Laboratory, United States) Bryce Allen (Argonne National Laboratory, United States) Brice Videau (Argonne National Laboratory, United States) THAPI: Tracing Heterogeneous APIs ABSTRACT. The computational power of high-performance computing systems leaped a big stride over the last decade. However, this remarkable performance enhancement comes entangled with increased system complexity. These systems now integrate heterogeneous computing components and diverse programming models. The programming models are usually implemented on top of each other, making performance analysis and debugging increasingly challenging. This paper proposes THAPI (Tracing Heterogeneous APIs), a portable, programming model-centric tracing framework that provides detailed programming model insights for debugging and performance optimization in heterogeneous HPC environments. THAPI traces comprehensive API call details across major programming models, ensuring fine-grained introspection and programming model context capture. In addition to this, THAPI integrates device sampling that enables real-time monitoring of critical metrics such as power, frequency, and utilization. The tracing framework is built on the Linux Trace Toolkit Next Generation (LTTng) for efficiency and scalability. THAPI also provides an analysis tool that leverages Babeltrace-based analysis to deliver a holistic view of execution, making it an essential tool for diagnosing performance bottlenecks and optimizing system behavior.

14:00-15:00 Session 36B: Track 6.2: Learning systems

14:00	Xinrui Yang (Harbin Institute of Technology, Shenzhen, China) Shaohuai Shi (Harbin Institute of Technology, Shenzhen, China) SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training ABSTRACT. The data-parallel distributed training technique is a de facto approach in training large-scale deep neural networks (DNNs) using synchronous stochastic gradient descent (S-SGD). However, S-SGD requires iteratively aggregating the distributed gradients through an AllReduce collective, which easily results in significant data communication across distributed GPUs and thus limits the scaling efficiency of the training system. In this paper, we propose an efficient and practical gradient sparsification and quantization algorithm, named SQ-DeAR, which not only significantly reduces the communication traffic, but also allows overlapping communications with both feed-forward and backpropagation computations to further improve the training performance. In addition, to improve the computation efficiency of gradient sparsification, we design a batched gradient sparsification to reduce the number of GPU launches. Performance evaluation on a 32-GPU cluster shows that SQ-DeAR outperforms state-of-the-art solutions by 1.17x-7.0x.
14:20	Samuel Wiggins (University of Southern California, United States) Nikunj Gupta (University of Southern California, United States) Grace Zgheib (Altera, United States) Mahesh Iyer (Altera, United States) Viktor Prasanna (University of Southern California, United States) Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms ABSTRACT. Multi-Agent Reinforcement Learning (MARL) enables multiple autonomous agents to simultaneously learn and make decisions in complex, interactive environments. Among various MARL paradigms, Independent Learning (IL) remains a dominant approach due to its simplicity and scalability, where each agent optimizes its policy independently, treating others as part of the environment. While IL eliminates the need for explicit inter-agent communication, existing MARL implementations fail to exploit its inherent parallelism. Current implementations train agents sequentially on a single accelerator, leading to severe underutilization of modern compute resources, particularly on multi-GPU platforms. In this work, we propose a multi-GPU training scheme that efficiently distributes independent agent policies across compute devices without altering the original IL semantics. To further enhance scalability, we design a dynamic load-balancing strategy that adaptively assigns training workloads based on computational demands and the varying capabilities of different GPUs, ensuring efficient utilization of hardware resources. Our approach achieves up to 15.5x higher throughput than state-of-the-art MARL implementations, demonstrating that fully leveraging the parallelism of IL can significantly accelerate MARL training, opening new possibilities for large-scale multi-agent learning in high-dimensional environments. We open-source our work with optimized implementations of widely used independent learning algorithms, enabling scalable MARL training on diverse accelerator platforms.
14:40	Wenxiang Lin (Harbin Institute of Technology, Shenzhen, China) Xinglin Pan (The Hong Kong University of Science and Technology, GuangZhou, China) Shaohuai Shi (Harbin Institute of Technology, Shenzhen, China) Xuan Wang (Harbin Institute of Technology, Shenzhen, China) Xiaowen Chu (The Hong Kong University of Science and Technology, Guangzhou, China) ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs. ABSTRACT. Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can shrink model sizes but often impair accuracy, making them unsuitable for practical applications. In this work, we introduce \modelname{}, a high-performance inference engine designed to speed up LLM inference without compromising model accuracy. \modelname{} incorporates three innovative methods to increase inference efficiency: 1) model partitioning to allow asynchronous processing of tasks across CPU computation, GPU computation, and CPU-GPU communication, 2) an adaptive partition algorithm to optimize the use of CPU, GPU, and PCIe communication capabilities, and 3) a token assignment strategy to handle diverse prompt and generation tasks during LLM inference. Comprehensive experiments were conducted with various LLMs such as Mixtral, LLaMA-2, Qwen, and PhiMoE across three test environments featuring different CPUs and GPUs. The experimental findings demonstrate that \modelname{} achieves speeds between $1.11\times$ to $1.80\times$ faster in decoding and $1.69\times$ to $6.33\times$ faster in pre-filling, leading to an overall speedup ranging from $1.25\times$ to $2.04\times$ compared to state-of-the-art solutions, llama.cpp and Fiddler.

14:00-15:00 Session 36C: WHPC Special Session: Advances in HPC Computing Applications

15:00-16:00Coffee Break and PhD Symposium and Poster&Demos Session

The PhD Symposium Posters and the Posters & Demos will be on display in this coffee break.

15:00-16:00 Session 37A: Demos&Poster Session during the Coffee Break

Manuel I. Capel (University of Granada, Spain)
Javier Gómez-Garaluz (University of Granada, Spain)
Juan A. Holgado (University of Granada, Spain)

Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark

ABSTRACT. In this work, we explore the efficient execution of metaheuristic optimization algorithms on Big Data platforms by leveraging the computational power of GPUs. We focus on the Particle Swarm Optimization (PSO) algorithm and propose a distributed implementation within the Apache Spark framework. Two approaches are compared: a classical Spark-based model using resilient distributed datasets (RDDs), and a GPU-accelerated version integrated with the RAPIDS Accelerator. We demonstrate the benefits of GPU acceleration through experiments on both regression and classification problems, including a case study on tumor detection. The experimental results show substantial improvements in execution time without significant loss in predictive accuracy.

Gia Bao Thieu (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)
Jasper Homann (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)
Sven Gesper (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)
Guillermo Payá-Vayá (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)

Portable and Scalable FPGA Emulation of a Massive-Parallel Vector Processor

ABSTRACT. Hardware accelerators are essential for efficiently executing compute- and data-intensive algorithms such as convolutional neural networks. The V2PRO is a programmable, massive-parallel vertical vector processor architecture designed for both configurability and scalability. To illustrate its adaptability under different resource and performance constraints, we present two FPGA-based demonstrators: one tailored for maximum performance on an UltraScale+ Xilinx FPGA (TySOM-3A-ZU19EG Board) and another optimized for a low-cost Zynq-7000 Xilinx FPGA (PYNQ-Z1 Board).

Sanyam Kaul (IIT Hyderabad, India)
Gayathri Shreeya P (IIT Hyderabad, India)
Manaswini Piduguralla (IIT Hyderabad, India)
Sathya Peri (IIT Hyderabad, India)

Modifying the HyperLedger Fabric Blockchain Architecture to increase throughput and decrease transaction rejections

ABSTRACT. Hyperledger Fabric has established itself as a prominent permissioned blockchain framework for enterprise applications, offering modular architecture, pluggable consensus mechanisms, and privacy-preserving features. Despite these advantages, the platform faces challenges in transaction throughput and rejection rates, particularly in high-demand scenarios. These limitations stem from Fabric's transaction flow architecture, which involves endorsement, ordering, and validation phases that can create bottlenecks under heavy loads.

This research proposes a novel approach to enhance Fabric's performance through a transaction dependency-aware execution mechanism. Our modifications introduce: (1) a dependency flagging system during endorsement where transactions are marked as independent (0) or dependent (1), utilizing a hashmap at the leader endorser to track transaction dependencies; (2) optimized block construction at the ordering service that prioritizes independent transactions; (3) creation of a Directed Acyclic Graph (DAG) representing transaction dependencies within each block; and (4) parallel execution of independent transactions at the committer level while processing dependent transactions according to their DAG relationships. Experimental results demonstrate improvements in transaction throughput and reduced rejection rates compared to standard Fabric implementation. Our findings suggest that dependency-aware transaction processing represents a promising approach to scaling Hyperledger Fabric for enterprise applications with complex transaction patterns.

Jj Merelo (University of Granada, Spain)
Gustavo Romero López (University of Granada, Spain)
Mario Garcia Valdez (Inst Tec de Tijuana, Mexico)

Time-related effects in the measurement of energy consumption in evolutionary algorithms

ABSTRACT. The main issue with the measurement of energy consumption of any kind of algorithm is establishing a methodology that allows an actionable comparison between different implementations in systems that actively implement their own energy-optimization strategies. In this poster, we will try to observe the effects of these strategies in energy-measurement experiments by looking at how the sequence or moment at which a specific measurement is taken affects the results. We will then propose statistics summaries that might be more actionable than a simple average.

Nikolai Beliaev (Independent researcher, Germany)

ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers

ABSTRACT. The paper is devoted to the problem of the automated construction of parallel programs. In the paper ParSolGen (Parallel Solvers Generator) - a tool that automatically constructs a highly efficient parallel program from a given numerical algorithm description is presented. It classifies the input numerical algorithm description written in ParSolGen language, determines the class it potentially belongs to and constructs the parallel program using the corresponding domain-specific parallel program construction module. This enables ParSolGen to employ well-known manual parallel programming techniques in the process of the automated parallel program construction. This approach improves the performance of the automatically generated parallel programs, as confirmed by performance tests.

Mario Bielert (ZIH, CIDS, TU Dresden, Germany)
Daniel Hackenberg (ZIH, CIDS, TU Dresden, Germany)

Towards Digital Twins of HPC Data Centres Modelling Infrastructure and HPC Systems for IT-Zauber

ABSTRACT. Data centres have close correlations between IT systems and the required infrastructure components, which are too complex to crack using classical design concepts. The project IT-Zauber tackles this gap by utilising a digital twin of the data centre, which comprises numerous models for the assorted systems and components. In this work, we present our ongoing modelling efforts for the LZR data centre of the TU Dresden.

Phani Sahasra Akkinepally (IIT Hyderabad, India)
Manaswini Piduguralla (IIT Hyderabad, India)
Sathya Peri (IIT Hyderabad, India)

Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection

ABSTRACT. Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. Traditional FL techniques have used a centralized server for learning. However, a centralized server suffers from bottlenecks and single-point failures due to the central server. A natural way to address this problem is to have a decentralized FL which does not have any central server. Decentralized Learning Systems can be broadly classified as synchronous and asynchronous. In Synchronous FL systems all workers compute gradients and wait for each other before aggregating updates, ensuring consistency. But in real-life distributed systems, device and network variations make synchronization impractical. Thus asynchronous FL is a practical system where clients learn independently without waiting for each other. In this work, our research is conducted in two phases to handle these issues. (a) In Phase 1, In Phase 1, we address the challenges in asynchronous systems by developing a FL framework that allows clients to update independently. (b) In Phase 2, we incorporate fault tolerance mechanisms into the asynchronous FL setup to manage issues such as client failures and message drops, ensuring robust performance and convergence under varying conditions.

Ruiwen Wang (Sorbonne University/Eurecom/Huawei France, France)
Chong Li (Huawei France, France)
Raja Appuswamy (Eurecom Institute, France)

H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training

ABSTRACT. Training today's hundred-billion-parameter neural networks is limited less by model architecture than by the efficiency with which accelerator clusters are utilized. Existing auto-parallel planners optimize at most three out of four major axes-data (DP), tensor (TP), pipeline (PP) and optimizer (OP) parallelism-and rarely co-tune micro-batch size or activation recomputation, leaving significant speedups untapped. We present H2O, a two-level, holistic hyper-parameter optimizer that jointly explores DP–TP–PP–OP degrees together with micro-batch size, stage assignment and selective recomputation under a unified analytic compute-communication-memory model. A coarse Level 1 search prunes the vast design space, while a fine-grained Level 2 search balances pipeline stages and recomputation to minimize overall iteration time under device-memory constraints. On a 128 devices cluster training a 141-billion-parameter DeepSeek model, H2O shows 22.38% faster than an expert hand-tuned baseline and 36.72% faster than the state-of-the-art auto-parallel planner. These results demonstrate that cross-axis optimization is already critical for contemporary hardware and will be indispensable in the trillion-parameter era.

15:00-16:00 Session 37B: PhD Symposium Poster Session during the Coffee Break

Sunil Kumar (IIIT-Delhi, India)

Power Scheduling on Multicore Multiprocessor Systems for Maximizing Throughput and Fairness

ABSTRACT. Hardware overprovisioning is a widely used technique to improve the average power utilization of computing systems by capping the processor’s power consumption. However, applying uniform power caps across sockets in multiprocessor systems can degrade the performance of co-running applications due to workload variation. Existing solutions primarily aim to enhance power utilization and system throughput but ignore fairness in sharing surplus power between applications and rely on processor-selected frequency settings under a power cap, which may be suboptimal. This thesis addresses these limitations through a two-stage power management approach. First, it proposes a system that dynamically and bidirectionally redistributes surplus power among applications to improve overall system throughput while ensuring fairness. When it identifies an application with surplus power, it donates power to other applications exhibiting high CPU utilization. When donor applications transition to the high CPU utilization phase, they are rewarded by having a portion of the transferred power returned to them. Second, it improves performance by configuring optimal frequency settings for each application under a power cap rather than relying on default processor settings. Once the performance is optimized, the system further reallocates power to balance performance gains across applications to improve the fairness between applications. The proposed system is evaluated on a quad-socket, 72-core Intel Xeon server using diverse HPC application mixes and power cap settings. The results demonstrate significant improvements in both system throughput and application-level fairness.

Yi-Hua Chung (University of Wisconsin-Madison, United States)
Nahmsuk Oh (Synopsys Inc, United States)
Malleswara Gupta Balabhadra Naga Venkata (Synopsys Inc, United States)
Aditya Shiledar (Synopsys Inc, United States)
Sudipto Kundu (Synopsys Inc, United States)
Vishal Khandelwal (Synopsys Inc, United States)
Tsung-Wei Huang (University of Wisconsin-Madison, United States)

Accelerating Gate Sizing using GPU

ABSTRACT. Gate sizing is important in VLSI design to optimize performance and meet timing constraints. Multi-core CPU-based approaches have been widely used to speed up the gate sizing algorithm, but their scalability is typically limited to 8-16 threads. To address this limitation, we propose a GPU algorithm to accelerate a time-consuming routine of gate sizing, namely the library cell (libcell) selection process, in an industrial-standard sizer. By leveraging both block- and warp-level parallelism, our algorithm can greatly accelerate the libcell selection time. Experimental results show that our GPU implementations achieve up to 38.13x speedup over a 16-core CPU baseline, while warp-level sizing can further achieve additional 4.77% improvement over block-level sizing.

Rubayet Rahman Rongon (Washington State University, United States)
Xuechen Zhang (Washington State University, United States)

SCOPE: Accelerating ML data pipeline using cloud-based computational storage

ABSTRACT. Modern machine learning (ML) pipelines increasingly suffer from data preprocessing bottlenecks-especially in scientific workloads using structured formats like HDF5. Existing systems lack support for remote preprocessing, provide no scheduling across heterogeneous environments, and underutilize cloud resources. We introduce SCOPE (Smart Computational Offloading for Preprocessing Efficiency), a runtime system that enables dynamic, cost-aware partitioning of preprocessing tasks between cloud and local heterogeneous sytem. SCOPE extends the h5py API to support remote transformations and uses a scheduler to select the optimal offload point to minimize end-to-end time. It also leverages idle cloud cycles to precompute samples ready to serve, improving throughput and reducing latency. Evaluations on CIFAR-10 and ImageNet-1K with ResNet models show that SCOPE significantly reduces preprocessing overhead and accelerates training, enabling scalable and efficient ML in cloud-connected environments.

Mateusz Grużewski (West Pomeranian University of Technology in Szczecin, Poland)

Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors

ABSTRACT. This PhD project addresses the automated generation of high-performance, portable code for dynamic programming problems, with a focus on RNA secondary structure prediction. The work combines polyhedral model-based compilation, domain-specific knowledge, and large language models (LLMs) to produce cache-efficient, parallel implementations targeting heterogeneous platforms including multicore CPUs and GPUs.

The research evolved through three stages. First, prompt-based LLMs (e.g., GPT-3.5) were applied to transform OpenMP code into CUDA, successfully generating correct and compilable kernels for complex algorithms such as Nussinov’s, without manual intervention. Next, a hybrid method combining polyhedral schedules with lightweight GPU tuning achieved up to $40\times$ speedup over GPT-based generation for Nussinov, with architecture-specific adjustments. Finally, a dedicated model named \textit{omniCUDA} was fine-tuned using synthetic and handcrafted OpenMP–CUDA pairs. This fully automated solution outperforms baseline PolyBench implementations and shows that aligned training on input–output code pairs enables high-quality source-to-source translation.

This work presents a semi-automated, retargetable compiler pipeline for affine dynamic programming kernels, capable of generating optimized code for architectures including x86, RISC-V, and CUDA with minimal manual intervention. By integrating classical compiler theory with modern AI-based techniques, it bridges the gap between high-level algorithm specifications and architecture-aware implementations across heterogeneous platforms.

Jack Strange (University of Manchester, UK)
Rizos Sakellariou (The University of Manchester, UK)

CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines

ABSTRACT. Oversubscription is a popular method of combatting the negative effects of virtual machine underutilisation in cloud datacenters, primarily increased energy consumption. It has been observed that the cores within a virtual machine are not underutilised to the same degree. With this knowledge, CoreWaterfall was developed to allow the individual cores of virtual machines to be oversubscribed by different amounts. By employing this core-specific approach to allocation and scheduling of virtual machines, the energy consumption of a datacenter is reduced when compared to a standard flat oversubscription mechanism.

Ben Thärigen (RWTH Aachen University, Germany)
Joachim Jenke (RWTH Aachen University, Germany)
Christian Terboven (RWTH Aachen University, Germany)
Matthias S. Müller (RWTH Aachen University, Germany)

On-the-fly Performance Analysis of Asynchronous Parallel Execution

ABSTRACT. As parallel computers continue to grow larger and parallel programs become increasingly complex, traditional block-synchronous programming approaches reach their limits. Asynchronous paradigms, such as tasking and offloading, have been shown to yield significant performance improvements for suitable applications, particularly those with irregular, dynamic workloads. While asynchronous paradigms offer a lot more flexibility during programming, they also make the performance analysis of applications using them more complex. This results in a high need for easy-to-use, scalable performance analysis tools. While tool support for tasking and offloading exists, these solutions often work post-mortem, and may not scale well to high numbers of tasks. Additionally, explicit support for offloading is predominantly post-mortem and rarely integrates the analysis of dependencies and inter-process communication. The PhD work discussed in this paper aims to investigate the possibilities of on-the-fly performance analysis of tasking and offloading applications. It will first define new possible metrics to be tracked alongside an application's execution. Extending an existing on-the-fly performance analysis tool will evaluate the metrics in terms of their insightfulness for performance analysis. Additionally, a performance analysis workflow with the extended tool will be developed.

Yao Lu (Beihang University, China)
Zhongzhi Luan (Beihang university, China)
Depei Qian (Beihang university, China)

TH-Pulse: A Study on Hardware-Software Co-Designed Framework for LLM Training and Inference on the Tianhe new-generation supercomputer

ABSTRACT. With large-scale pre-trained models achieving breakthrough progress in numerous domains, the computational complexity and memory demands during training and inference phases present increasingly severe challenges. The Tianhe new-generation supercomputer, leveraging their formidable computational power and unique architectural advantages, offers new opportunities for efficient large model training and inference. Targeting these characteristics, this paper investigates and proposes hardware-software co-optimization strategies aimed at enhancing the efficiency of large model training and inference on the Tianhe platform. By establishing a deep co-optimization mechanism between hardware features and model architecture, we achieve key objectives including maximizing data locality under extremely low memory bandwidth constraints, optimizing the utilization of heterogeneous many-core resources, and balancing communication overhead with computational efficiency in distributed strategies. Experimental results demonstrate that the proposed optimization framework achieves significant performance improvements on typical large models, providing an effective solution for industrial-scale large model applications on the Tianhe supercomputing platform.

Bin Han (Beihang University, China)
Ming Gong (Institute of High Energy Physics, Chinese Academy of Sciences, China)
Zhongzhi Luan (Beihang University, China)
Depei Qian (Beihang University, China)

DCG-DDQ: A Directed Cyclic Graph Based Task Computing System

ABSTRACT. The growth of heterogeneous multi-core processors and the irregular nature of scientific computing have led to the increased use of parallel decomposition strategies. However, developing high-performance parallel applications remains challenging due to technical complexity and time constraints. Task graph computing systems facilitate the implementation of parallel strategies on heterogeneous platforms. Existing systems face three key limitations: (1) limited support for in-graph control flow, which restrict task graph expressiveness; (2) inability to overlap task execution, leading to resource bottlenecks; (3) poor cross-hardware portability, hindering adaptation to emerging architectures. To address these issues, we propose DCG-DDQ, a cross-platform task graph computing system. DCG-DDQ enhances task graph expressiveness by using directed cyclic graphs to embed in-graph control flow. A finite state machine model addresses complex scheduling challenges in cyclic task graphs. By decoupling application logic from hardware-specific code, DCG-DDQ ensures portability across heterogeneous architectures. Experimental results show that DCG-DDQ outperforms OpenMP in execution time for irregular task scheduling.

Gia Bao Thieu (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)
Guillermo Payá-Vayá (Chair for Chip Design for Embedded Computing, TU Braunschweig, Germany)

A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors

ABSTRACT. Dedicated hardware accelerators are essential to meet the compute and power demands of data-intensive applications such as convolutional neural networks. Thereby, the design of the accelerator's memory system is critical: parallel accelerators often rely on DMA controllers instead of traditional cache hierarchies to match massive on-chip compute throughput. However, when scaling the architecture, the DMA count grows, and contention at the external memory controller becomes a serious bottleneck. Moreover, different memory technologies - from single-channel DDR4 to multi-channel HBM - demand different requirements on the accelerator’s memory architecture. This paper presents an ongoing PhD research aimed at optimizing DMA-based memory systems through a novel Direct Cached Memory Access (DCMA). The DCMA is used to transparently optimize memory transfers, effectively hiding external memory complexity from the accelerator architecture. We describe the research methodology, report preliminary results from an FPGA-based evaluation, and outline the next steps.

Pawel Bratek (Czestochowa University of Technology, Poland)
Lukasz Szustak (Czestochowa University of Technology, Poland)
Jaroslaw Zola (University at Buffalo, United States)

Boosting Performance of Counting Queries in Machine Learning Applications with a ccNUMA-aware Implementation

ABSTRACT. Counting queries are fundamental operations in many machine learning applications. In this work, we introduce the ccNUMA-aware strategy for the parallel execution of a counting query stream. Our approach improves data locality and ensures balanced workloads across NUMA domains. The proposed strategy is compatible with any underlying counting method, including our previously developed auto-scheduling mechanism that selects the optimal method for each query in an online manner. Experiments on two AMD-based dual-socket servers show up to 10× speedup over a standard parallelization approach for large-scale query workloads.

Hamid Moghadaspour (Instituto de Telecomunicações, Department of Electrical and Computer Engineering, University of Coimbra, Portugal, Portugal)
Nuno Neves (INESC-ID, IST, Universidade de Lisboa, Portugal)
Oscar Ferraz (Instituto de Telecomunicações, Portugal)
Paulo Peixoto (Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra, Portugal, Portugal)
Jorge Lobo (Instituto de Telecomunicações, Department of Electrical and Computer Engineering, University of Coimbra, Portugal, Portugal)
Gabriel Falcao (Instituto de Telecomunicacoes, Portugal)

EAGER: Energy-Aware 3D Gaussian Splatting on Embedded Parallel Heterogeneous Systems

ABSTRACT. 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering but remains underexplored on embedded hardware. This work benchmarks the baseline 3DGS pipeline across heterogeneous systems, including NVIDIA Jetson Xavier, Orin, and RTX 4090. We evaluate rendering speed, energy-per-pixel efficiency, and visual quality across four datasets. Results reveal trade-offs between performance and energy, highlighting each platform’s deployment suitability. Findings are contextualized within energy trends in embedded VISLAM systems. This study offers a reference point for constrained neural rendering and motivates future integration of SLAM and 3DGS pipelines for real-time robotic applications.

Ashwina Kumar (Indian Institute of Technology, Madras, India)
Rupesh Nasre (Indian Institute of Technology,Madras, India)

AskLLVM: LLVM Code Generation for GPUs for Graph Algorithms

ABSTRACT. Code generation is a critical stage in compiler design, responsible for translating high-level intermediate representations into optimized machine code. This paper presents a systematic exploration of code generation within the LLVM (Low-Level Virtual Machine) compiler infrastructure, emphasizing its modular backend design and target-independent optimizations. We investigate the transformation pipeline from LLVM Intermediate Representation (IR) to target-specific assembly, highlighting the roles of instruction selection, register allocation, and machine-specific lowering. The paper introduces enhancements to the LLVM code generation process, including custom pattern matching for improved instruction selection and a novel register allocation heuristic that minimizes spill code in irregular architectures. Our results reaffirm LLVM's suitability as a research platform for backend innovations and contribute new insights into scalable and portable code generation strategies.

Jan Meizner (Sano - Centre for Computational Personalised Medicine - International Research Foundation, Poland)
Maciej Malawski (Sano - Centre for Computational Personalised Medicine - International Research Foundation, Poland)

Heterogeneous computing, storage and network infrastructures for medical applications

ABSTRACT. In this paper we present work already done in the scope of the PhD thesis related to selection of the best infrastructure for a wide range of medical applications. In the beginning we define motivation for the work. Later we describe classes of those applications highlighting most important resources being used by each class. It is followed by the state of the art analysis of appropriate monitoring platforms. Later we present a set of experiments that were already performed as well as those that are planned for the remaining duration of the work on the thesis. We focus on large-scale systems such as HPC clusters as well as computational clouds. We compare and contrast various aspects of those infrastructures, highlighting features most critical for each class of applications. Finally, the paper provides a summary of the results gathered, followed by a discussion of their impact on the subjects of the work.

16:00-17:30 Session 38A: Track 2.2: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows

16:00	Jianfeng Gu (Technical University of Munich, Germany) Puxuan Wang (Technical University of Munich, Germany) Isaac David Núñez Araya (Technical University of Munich, Germany) Kai Huang (Sun Yat-sen University, China) Michael Gerndt (Technical University of Munich, Germany) HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences ABSTRACT. Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve performance uncertainty introduced by massive fine-grained resource configuration spaces, we propose the Resource-aware Performance Predictor (RaPP). Furthermore, we present an adaptive hybrid auto-scaling algorithm with both horizontal and vertical scaling to ensure inference SLOs and minimize GPU costs. The experiments demonstrated that compared to the mainstream serverless inference platform, HAS-GPU reduces function costs by an average of 10.8x with better SLO guarantees. Compared to state-of-the-art spatio-temporal GPU sharing serverless framework, HAS-GPU reduces function SLO violation by 4.8x and cost by 1.72x on average.
16:20	Yiming Sun (Institute of Computing Technology, Chinese Academy of Sciences, China) Jiaqi Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China) Jie Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China) Huawei Cao (Institute of Computing Technology, Chinese Academy of Sciences, China) Xuejun An (Institute of Computing Technology, Chinese Academy of Sciences, China) Xiaochun Ye (Institute of Computing Technology, Chinese Academy of Sciences, China) CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration ABSTRACT. The serverless computing model offers users flexible, pay-as-you-go services. However, existing frameworks face challenges such as resource over-subscription and workload over-scaling when deploying graph processing jobs in serverless environments. To address these limitations, we introduce CGP–Graphless, which utilizes CPU–GPU heterogeneous computing to enable efficient vertical scaling. This approach divides graph processing into two phases: querying on a core proxy graph and correction on the full graph. GPU containers are allocated for the querying phase, while CPU containers handle the correction phase. Furthermore, we propose an adaptive pipelined scheduling strategy between these phases, which leverages pressure-aware intra-pipeline scaling to transform excessive horizontal scaling into vertical scaling, thereby achieving efficient serverless graph computation. Experiments show that CGP–Graphless improves end-to-end performance by up to 2.00x over FaaSGraph under concurrent stress evaluations, while halving allocated CPU cores through additional GPU containers. In short-interval query scenarios, CGP–Graphless further reduces average request latency by 3.30x compared to FaaSGraph.
16:40	Kaicheng Guo (Shanghai Jiao Tong University, China) Jingyi Chen (Shanghai Jiao Tong University, China) Yun Wang (Shanghai Jiao Tong University, China) Semakin Anton (Huawei technologies co. ltd., Russia) Tovmachenko Dmitry (Huawei technologies co. ltd., Russia) Jiajie Sheng (Shanghai Jiao Tong University, China) Jianwen Wei (Shanghai Jiao Tong University, China) James Lin (Shanghai Jiao Tong University, China) Zhengwei Qi (Shanghai Jiao Tong University, China) Haibing Guan (Shanghai Jiao Tong University, China) Design and Operation of Elastic GPU-pooling on Campus ABSTRACT. With the rapid advancement of Artificial Intelligence (AI) and High-Performance Computing (HPC), GPU-centric clusters have become pivotal in driving research across diverse disciplines. Consequently, campus data centers are increasingly aggregating substantial computing resources. However, akin to commercial GPU and AI clusters, campus GPU clusters frequently encounter challenges related to GPU underutilization. While pooling technologies offer a viable solution to this issue, existing GPU pooling approaches fall short in effectively supporting environments where multiple applications coexist within campus data centers. This paper presents gPooling, a novel pooling scheme that leverages device driver hijacking to optimize GPU resource allocation. We designed a benchmark based on real-world traces from a campus data center and deployed gPooling within a GPU cluster environment. Experimental results from both benchmarking and actual deployment demonstrate that gPooling significantly enhances GPU utilization and reduces user waiting times, thereby improving the overall efficiency of campus GPU clusters.
17:00	Mingxuan Liu (Northwestern Polytechnical University, China) Jianhua Gu (Northwestern Polytechnical University, China) Tianhai Zhao (Northwestern Polytechnical University, China) ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory ABSTRACT. Embedding-based recommender systems (RecSys) face critical bottlenecks in storing massive embedding tables (EMTs) and accelerating latency-sensitive lookup operations. While serverless computing and disaggregated memory architectures offer resource efficiency, existing solutions struggle with cold-start penalties, state transfer overheads, and rigid resource scaling for EMT-bound workloads. We present ServerlessRec, a serverless framework that integrates memory disaggregation with kernel-space RDMA-based remote memory mapping (rmap) and remote fork (rfork) to enable elastic EMT lookups. ServerlessRec decouples compute nodes (CNs) from memory nodes (MNs), dynamically autoscaling serverless functions on CNs for bursty lookups while storing EMTs on MNs. ServerlessRec improves latency-bounded throughput by 3.8× over DisaggRec, reduces resource waste by 62% vs. ElasticRec, which demonstrates how kernel-level memory disaggregation unlocks serverless advantages for EMT-bound applications.

16:00-17:30 Session 38B: Track 6.3: Stream, Image and Sequence Processing

16:00	Apurv Deepak Kulkarni (Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany) Siavash Ghiasvand (Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany) SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure ABSTRACT. Despite recent advancements in data stream processing frameworks that emphasize the importance of real-time data handling, scalability remains a significant challenge, directly impacting both throughput and latency. Although prior research has investigated this issue on local machines and cloud clusters, there remains a gap in studies exploring its implications for modern high performance computing (HPC) systems, primarily due to the scarcity of scalable measurement tools. This work introduces SProBench, a novel benchmark designed to evaluate scaling performance of data stream processing frameworks on HPC systems, incorporating a modular architecture and various workload pipelines. Notably, this benchmark suite is, to the best of our knowledge, the first to provide native support for SLURM-based clusters. The current implementation of SProBench enables out-of-the-box performance evaluation for Apache Flink, Apache Spark Streaming, and Apache Kafka Streams, while allowing for easy integration of additional frameworks with minor modifications. Building upon established best practices from previous research, SProBench is designed for effective execution on HPC systems and is highly scalable. Real-world cluster evaluations denote the benchmarks linear scaling capabilities. Furthermore, this study outlines the collection and integration of metrics from external monitoring tools for post-processing analysis. This data confirmed and validated the benchmarks functionality and efficient scaling performance.
16:20	Lifeng Yan (Shandong University, China) Zekun Yin (Shandong University, China) Qixin Chang (Shandong University, China) Tong Zhang (Shandong University, China) Zhisong Wang (Shandong University, China) Xiaohui Duan (Shandong University, China) Bertil Schmidt (Johannes Gutenberg University, Germany) Weiguo Liu (Shandong University, China) SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture ABSTRACT. Sequence alignment is a crucial step in next-generation sequencing data analysis. However, most sequence aligners face performance challenges due to high computational complexity and extensive random memory access patterns, making them a significant bottleneck in the overall analysis pipeline, such as the industry gold standard BWA-MEM. The next-generation Sunway platform, with its high computational power and unique heterogeneous architecture, presents new opportunities for enhancing the efficiency of sequence alignment. In this work, we introduce SWBWA, a high-accuracy and high-performance sequence aligner designed for the new Sunway architecture. By redesigning the parallel framework tailored for Sunway, performing software prefetching optimization, vectorizing the striped Smith-Waterman algorithm, and addressing memory access bottlenecks in bigshare mode, SWBWA achieves a 330× speedup over the single-threaded unoptimized version. Additionally, SWBWA running on a Sunway workstation can achieve 1.2–1.4× speedups compared to BWA-MEM running on a dual-socket 48-core x86 server, while ensuring nearly identical output. The source code is publicly available at https://github.com/RabbitBio/SWBWA.
16:40	Marie Reinbigler (Télécom SudParis - Institut Polytechnique de Paris, France) Rishi Sharma (EPFL, Switzerland) Rafael Pires (EPFL, Switzerland) Elisabeth Brunet (Télécom Sudparis - Institut Polytechnique de Paris, Inria, France) Anne-Marie Kermarrec (EPFL, Switzerland) Catalin Fetita (Télécom Sudparis - Institut Polytechnique de Paris, France) Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster ABSTRACT. Analyzing gigapixel images is recognized as computationally demanding. In this paper, we introduce PyramidAI, a technique for analyzing gigapixel images with reduced computational cost. The proposed approach adopts a gradual analysis of the image, beginning with lower resolutions and progressively concentrating on regions of interest for detailed examination at higher resolutions. We investigated two strategies for tuning the accuracy-computation performance trade-off when implementing the adaptive resolution selection, validated against the Camelyon 16 dataset of biomedical images. Our results demonstrate that PyramidAI substantially decreases the amount of processed data required for analysis by up to 2.65×, while preserving the accuracy in identifying relevant sections on a single computer. To ensure democratization of gigapixel image analysis, we evaluated the potential to use mainstream computers to perform the computation by exploiting the parallelism potential of the approach. Using a simulator, we estimated the best data distribution and load balancing algorithm according to the number of workers. The selected algorithms were implemented and highlighted the same conclusions in a real-world setting. Analysis time is reduced from more than an hour to a few minutes using 12 modest workers, offering a practical solution for efficient large-scale image analysis.

16:00-17:30 Session 38C: WHPC Special Session: Advances in HPC Computing Applications