EURO-PAR 2025: 31ST INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR WEDNESDAY, AUGUST 27TH
Days:
previous day
next day
all days

View: session overviewtalk overview

10:30-11:00Coffee Break
11:00-12:30 Session 11A: Track 2.1: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
11:00
Daniel Medeiros (KTH Royal Institute of Technology, Sweden)
Jeremy Williams (KTH Royal Institute of Technology, Sweden)
Jacob Wahlgren (KTH Royal Institute of Technology, Sweden)
Leonardo Saud Maia Leite (KTH Royal Institute of Technology, Sweden)
Ivy Peng (KTH Royal Institute of Technology, Sweden)
ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments

ABSTRACT. Existing state-of-the-art vertical autoscalers for containerized environments are traditionally built for cloud applications, which might behave differently than HPC workloads with their dynamic resource consumption. In these environments, autoscalers may create an inefficient resource allocation. This work analyzes nine representative HPC applications with different memory consumption patterns. Our results identify the limitations and inefficiencies of the Kubernetes Vertical Pod Autoscaler (VPA) for enabling memory elastic execution of HPC applications. We propose, implement, and evaluate ARC-V. This policy leverages both in-flight resource updates of pods in Kubernetes and the knowledge of memory consumption patterns of HPC applications for achieving elastic memory resource provisioning at the node level. Our results show that ARC-V can effectively save memory while eliminating out-of-memory errors compared to the standard Kubernetes VPA.

11:20
Thomas Jakobsche (University of Basel, Switzerland)
Osman S. Simsek (University of Basel, Switzerland)
Jim Brandt (Sandia National Laboratories, United States)
Ann Gentile (Sandia National Laboratories, United States)
Florina M. Ciorba (University of Basel, Switzerland)
An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment

ABSTRACT. High Performance Computing (HPC) systems rely on fixed user-provided estimates of job time limits. These estimates are often inaccurate, resulting in inefficient resource use and the loss of unsaved work if a job times out shortly before reaching its next checkpoint. This work proposes a novel feedback-driven autonomy loop that dynamically adjusts HPC job time limits based on checkpoint progress reported by applications. Our approach monitors checkpoint intervals and queued jobs, enabling informed decisions to either early cancel a job after its last completed checkpoint or extend the time limit sufficiently to accommodate the next checkpoint. The objective is to minimize tail waste, that is, the computation that occurs between the last checkpoint and the termination of a job, which is not saved and hence wasted. Through experiments conducted on a subset of a production workload trace, we show a 95% reduction of tail waste, which equates to saving approximately 1.3% of the total CPU time that would otherwise be wasted. We propose various policies that combine early cancellation and time limit extension, achieving tail waste reduction while improving scheduling metrics such as weighted average job wait time. This work contributes an autonomy loop for improved scheduling in HPC environments, where system job schedulers and applications collaborate to significantly reduce resource waste and improve scheduling performance.

11:40
Rajat Bhattarai (Tennessee Tech University, United States)
Howard Pritchard (Los Alamos National Laboratory, United States)
Sheikh Ghafoor (Tennessee Tech University, United States)
Enabling Elasticity in Scientific Workflows for High Performance Computing Systems

ABSTRACT. Modern research relies on scientific workflows to manage complex computational tasks across high-performance computing (HPC) systems. Most HPC systems still use a largely static resource allocation strategy, which means resources are fixed during runtime. However, modern workflows are increasingly data and event-driven, especially with the integration of artificial intelligence tasks. Static workflows lack the flexibility to adapt at runtime to evolving workflow and system demands, often impacting resource utilization and overall efficiency.

To address these concerns, scientific workflows require compute elasticity, the capability to dynamically adjust resource usage based on computational load and system availability. Elastic workflows can enhance computational efficiency by leveraging adaptive scheduling algorithms and smart resource management strategies. However, implementing elastic workflows in HPC environments is challenging, as it requires modifications across the entire HPC software stack, including workflow managers, resource managers, and middleware layers. This paper presents the design and implementation of an elastic workflow framework for HPC, highlighting key challenges and design considerations. Our framework is based on an elastic Parsl workflow manager and a custom elastic resource manager, both leveraging capabilities of an implementation of the Process Management Interface for Exascale (PMIx) API as middleware. Through real-world and synthetic workflow case studies, we demonstrate improved resource utilization, reduced execution times, showcasing the benefits of elasticity in HPC workflows.

12:00
Marta Navarro (Universitat Politècnica de València, Spain)
Vicent Pallardó-Julià (Universitat de València, Spain)
Salvador Petit (Universitat Politècnica de València, Spain)
Maria Gomez (Universitat Politècnica de València, Spain)
Julio Sahuquillo (Universitat Politècnica de València, Spain)
WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy

ABSTRACT. Simultaneous multithreading (SMT) processors improve system throughput by sharing core resources among the threads running on the same core. However, intra-core interference can cause co-running applications to degrade each other's performance significantly.

To address this issue, some approaches have focused on balancing contention at the core shared resources (e.g. the shared L1 data cache). A key advantage of these approaches is that they are workload-agnostic. Other approaches improve the previous ones by modeling the inter-application interference across the intra-core shared resources. Unfortunately, these approaches require off-line model training for specific workloads.

This paper presents WAPA, a CPI-based thread-to-core allocation approach that incorporates the best of both worlds. WAPA is a workload-agnostic policy that implicitly accounts for inter-thread interference across all the shared resources by leveraging the CPI. The proposed approach relies on the optimal transport (OT) theory, a mathematical theory to dynamically select symbiotic pairs of applications.

Experimental results in an Intel Xeon show that WAPA outperforms the default Linux scheduler on average by 8.4% in IPC in workloads dominated by cache and main memory latencies, while performance gains of existing approaches are below 3.2%.

11:00-12:30 Session 11B: Track 3.1: Neural Network Acceleration and Optimization
11:00
Yudong Mu (Institute of Computing Technology, Chinese Academy of Sciences, China)
Zhihua Fan (Institute of Computing Technology, Chinese Academy of Sciences, China)
Xiaoxia Yao (China Mobile Research Institute, China)
Wenming Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Zhiyuan Zhang (Insitute of Computing Technology, Chinese Academy of Sciences, China)
Honglie Wang (Institute of Automation, Chinese Academy of Sciences, China)
Xuejun An (Insitute of Computing Technology, Chinese Academy of Sciences, China)
Xiaochun Ye (Institute of Computing Technology, Chinese Academy of Sciences, China)
FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference

ABSTRACT. Diffusion models have emerged as powerful tools for generative AI tasks. While prior research primarily focuses on eliminating redundancy across timesteps, models like Stable Diffusion introduce a ResNet-Transformer Alternating Execution (RTAE) Pattern, where convolution and attention operators execute sequentially within each timestep. This execution pattern creates two major challenges: (1) excessive on-chip memory access due to frequent data movement between convolution and Transformer operations, and (2) underutilization of computational resources caused by the distinct processing characteristics of these two operators. To tackle these challenges, we propose FDHA, an accelerator designed for efficient diffusion model inference. First, to mitigate redundant on-chip memory access, FDHA introduces an inter-operator dataflow fusion mechanism that strategically aligns ResNet's convolution and Transformer's matrix multiplication dimensions, enabling efficient kernel reuse. Second, to maximize computational resource utilization, FDHA employs a heterogeneous architecture with dedicated Processing Elements for convolutions and Tensor Processing Elements for matrix multiplications, allowing for pipelined execution. Experimental results demonstrate that FDHA achieves 3.28x speedup over an NVIDIA A100 GPU and 2.62x speedup over a SoTA diffusion accelerator.

11:20
Jiale Dong (University of Science and Technology of China, China)
Hao Wu (University of Science and Technology of China, China)
Zihao Wang (University of Science and Technology of China, China)
Wenqi Lou (University of Science and Technology of China, China)
Zhendong Zheng (University of Science and Technology of China, China)
Lei Gong (University of Science and Technology of China, China)
Chao Wang (University of Science and Technology of China, China)
Xuehai Zhou (University of Science and Technology of China, China)
CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA

ABSTRACT. Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28\% accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35x improvement in throughput, and over 80\% energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining <1\% accuracy loss across vision benchmarks. The code will be open-sourced.

11:40
Joonyup Kwon (Korea University, South Korea)
Jinhyeok Choi (Korea University, South Korea)
Ngoc-Son Pham (Korea University, South Korea)
Sangwon Shin (Korea University, South Korea)
Taeweon Suh (Korea University, South Korea)
SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration

ABSTRACT. This paper proposes SkipNZ, a novel approach to reduce computational demands with negligible accuracy loss, improving the speed of sparse tensor convolutional neural networks. SkipNZ extends existing zero-value skipping techniques and enables skipping MAC operations for non-zero values. The evaluation results demonstrate that the proposed technique reduces computational load and improves overall performance with negligible accuracy loss. Compared to baseline, SkipNZ reduces execution time to 0.71× in AlexNet with Gap9 with 0.1% accuracy loss. Also, VGG16 with Gap8 achieves the reduction to 0.78× with 0.0% accuracy loss. The reduction in execution time comes from filtering out unnecessary MAC operations introduced by SkipNZ. The synthesized result shows that the additional hardware required by SkipNZ is only increased by 0.005% compared to the baseline.

12:00
Piyumal Ranawaka (Chalmers University of Technology, Sweden)
Per Stenstrom (Chalmers University of Technology, Sweden)
BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators

ABSTRACT. Multi-DNN accelerators enable the simultaneous execution of multiple DNN workloads, enhancing performance by overlapping computations and memory access of multiple DNN workloads. However, this increases the demand on on-chip memory, which must handle the footprints of several workloads. Batching allows multiple consecutive DNN inferences from the same model to share weights, improving weight reuse and reducing off-chip access costs over a batch. However, for batching to be effective, input feature maps of all inferences in the batch must fit in on-chip memory. Thus, the offline-determined batch size of each workload can exhaust available on-chip memory when multiple workloads share it, resulting in reduced performance.

This paper introduces BATCH-DNN, a dynamic method for adapting batch size at a layer-by-layer basis to available on-chip memory, enhancing multi-DNN accelerator performance. It employs two techniques: adaptive cascaded sub-batching and adaptive sub-batch merging. Offline profiling assesses the footprint, while run-time adjustment establishes the maximum batch size on a layer-by-layer basis based on available on-chip memory.

11:00-12:30 Session 11C: Track 6.1: Memory and I/O Systems
11:00
Yisu Wang (HKUST (GZ), China)
Xinjiao Li (HKUST (GZ), China)
Ruilong Wu (HKUST (GZ), China)
Huangxun Chen (HKUST (GZ), China)
Dirk Kutscher (HKUST (GZ), Germany)
NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning

ABSTRACT. Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput—ultimately affecting convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information.

This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network performance, NetSenseML optimizes the balance between reducing data payloads and preserving model accuracy, minimizing the trade-offs between transmission efficiency, convergence speed, and performance.

Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. We present the design of the NetSenseML adaptive data reduction function and experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55 to 9.84× compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

11:20
John W. Romein (Stichting ASTRON (Netherlands Institute for Radio Astronomy), Netherlands)
Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU

ABSTRACT. Radio telescopes produce enormous amounts of data. Many of them use GPU clusters to combine the digitized antenna signals, usually in real time. Achieving high data rates is challenging: the PCIe bandwidth of discrete GPUs is limited, and without RDMA, handling 200 or 400~Gb/s Ethernet packets with telescope data is difficult.

The NVIDIA Grace Hopper is a novel, innovative system that eliminates the I/O bottleneck of traditional, discrete GPUs by using NVLink instead of PCIe. This opens the door to higher data rates, but faster hardware alone is not enough. In this paper, we combine hardware and software innovations to process Ethernet packets at no less than 1.2~Tb/s, a huge improvement over what was previously possible. We use the Data Plane Development Kit to minimize the receive overhead, and use a new feature that allows packet processing directly by the GPU. We demonstrate the data handling in a correlator application, analyze the performance, and show how to reduce the energy use.

The presented innovations enable the use of GPUs for more powerful telescopes with much higher data rates. The results are also of interest to (GPU) applications from other application domains with high I/O demands, especially if RDMA is not available.

11:40
Tianyu Wan (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China)
Shijia Gong (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China)
Yangyang Hu (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China)
Jianxi Chen (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, China, China)
GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory

ABSTRACT. Disaggregated memory separates compute and storage nodes into two independent memory pools, connected via RDMA or CXL links. Disaggregated memory improves resource utilization, saves cost overhead, and ensures elastic scalability of memory and compute resources. Tree indexes are essential guarantees in storage systems, such as databases or KV storage. Existing disaggregated memory systems suffer from poor write performance, mainly due to concurrency conflicts, frequent Structure Modify Operation(SMO) operations, and high lock overhead on tree index.

To solve this problem, we propose GECKO, a write-optimized Adaptive Radix Tree(ART) index structure for disaggregated memory. We leverage 1) a write-optimized buffer node to handle concurrent writes, improving write performance, 2) a threshold-based splitting strategy to reduce splits and optimize SMO operations, 3) a post-insertion lock design to reduce lock overhead and reduce insertion tail latency. We compare GECKO with state-of-the-art solutions, experiments show that we improves throughput by 1.43×-3.21× under write workloads while SMO operation time is reduced by 88.5%, and lock time is reduced by 87.9%.

12:00
Jhonatan Cléto (Universidade Estadual de Campinas (UNICAMP), Brazil)
Guilherme Valarini (Universidade Estadual de Campinas (UNICAMP), Brazil)
Marcio Pereira (Universidade Estadual de Campinas (UNICAMP), Brazil)
Guido Araujo (Universidade Estadual de Campinas (UNICAMP), Brazil)
Hervé Yviquel (Universidade Estadual de Campinas (UNICAMP), Brazil)
Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication

ABSTRACT. Heterogeneous multi-node clusters with accelerators, such as GPUs, are increasingly the standard for HPC, where the MPI+OpenMP approach is commonly used for application development. However, this approach poses significant challenges for developers, especially in managing communication, synchronization, and load balancing across distributed nodes and accelerators. To address these challenges, this paper proposes the MPI Proxy Plugin (MPP), an extension of the LLVM OpenMP Offloading runtime that transparently offloads OpenMP target regions to remote accelerators via MPI. By abstracting communication and using the asynchronous mechanisms of MPI with C++20 coroutines, MPP provides a scalable alternative to MPI+OpenMP, enabling simpler development of heterogeneous HPC applications through the familiar OpenMP programming model. Experimental results shows that MPP achieves excellent scalability. For compute-intensive proxy-application, it scales nearly linearly, reaching a 63x speedup from 1 to 64 GPUs. While naive data transfers can degrade performance, this research reveals that extending OpenMP Target with collective operations (e.g., multi-device broadcast) simplifies development and improves performance, achieving up to 7x speedup in communication-bound benchmarks.

12:30-14:00Lunch Break
14:00-15:00 Session 12A: Track 1.1: Performance Analysis and Simulation
14:00
Anna-Lena Roth (Hochschule Fulda, University of Applied Sciences, Germany)
David James (Hochschule Fulda, University of Applied Sciences, Germany)
Michael Kuhn (Otto von Guericke University Magdeburg, Germany)
Dustin Frisch (Hochschule Fulda, University of Applied Sciences, Germany)
Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights

ABSTRACT. In-depth understanding of collective MPI communication is a major challenge for both beginners and experienced developers. It is challenging to grasp the operations of collective algorithms, as many performance analysis tools cannot break down collective operations into their underlying point-to-point communication. However, collective communication is a key factor in optimizing the performance of parallel programs. EduMPI is a novel tool for parallel programming education, providing near-real-time visualization of collective MPI algorithms. It displays per-process data exchange and highlights performance issues like Late Sender and Late Receiver. This paper introduces the visualization of collective communication in EduMPI, demonstrates how this representation aids in understanding the concept of collective communication and its relevance for performance optimization, and evaluates its effectiveness in an educational context. EduMPI bridges the gap in understanding complex collective MPI operations by providing a transparent view of the underlying processes, enabling students to visualize and better understand the communication flow.

14:20
Jaewoo Son (Seoul National University, South Korea)
Youngchul Yoon (Seoul National University, South Korea)
Soonhoi Ha (Seoul National University, South Korea)
TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems

ABSTRACT. Compute Express Link (CXL) is recognized as a revolutionary technology in high-performance computing (HPC) system design, driven by the growing demand for efficient and scalable memory solutions tailored to memory-centric workloads. However, despite its potential, evaluating CXL performance in real-world scenarios is challenging due to the lack of CXL hardware and the high costs of building a large-scale distributed system. To address this, we propose TSim4CXL, a novel trace-driven simulation framework for CXL-based HPC systems that provides accurate timing simulations within a practical timeframe. TSim4CXL separates computing resources from the CXL memory system, generating traces and simulating the memory system using SystemC’s discrete-event modeling. By modeling the CXL interconnect at the protocol level with various configuration parameters, TSim4CXL allows us to explore the design space of HPC architecture. The accuracy of our CXL simulation model is validated using CXL hardware provided by Samsung Electronics. First, we compare load latencies using a custom microbenchmark on CXL hardware with simulation results and adjust the CXL parameters in our simulator accordingly. Second, we assess communication latency by running LAMMPS applications, ensuring the simulation results align with real-world performance. In addition, we perform design space exploration with two memory-centric applications, up to 25 CPU nodes for LAMMPS and 4 GPU nodes for LLM training. Furthermore, we compare the performance of target applications by executing multiple DRAM simulators, demonstrating how the memory bandwidth affects simulated time. These experiments prove the viability of the proposed simulation framework.

14:40
Solomon Bekele (Argonne National Laboratory, United States)
Aurelio Vivas (University De Los Andes - Colombia, Colombia)
Thomas Applencourt (Argonne National Laboratory, United States)
Kazutomo Yoshii (Argonne National Laboratory, United States)
Swann Perarnau (Argonne National Laboratory, United States)
Servesh Muralidharan (Argonne National Laboratory, United States)
Bryce Allen (Argonne National Laboratory, United States)
Brice Videau (Argonne National Laboratory, United States)
THAPI: Tracing Heterogeneous APIs

ABSTRACT. The computational power of high-performance computing systems leaped a big stride over the last decade. However, this remarkable performance enhancement comes entangled with increased system complexity. These systems now integrate heterogeneous computing components and diverse programming models. The programming models are usually implemented on top of each other, making performance analysis and debugging increasingly challenging.

This paper proposes THAPI (Tracing Heterogeneous APIs), a portable, programming model-centric tracing framework that provides detailed programming model insights for debugging and performance optimization in heterogeneous HPC environments. THAPI traces comprehensive API call details across major programming models, ensuring fine-grained introspection and programming model context capture. In addition to this, THAPI integrates device sampling that enables real-time monitoring of critical metrics such as power, frequency, and utilization. The tracing framework is built on the Linux Trace Toolkit Next Generation (LTTng) for efficiency and scalability. THAPI also provides an analysis tool that leverages Babeltrace-based analysis to deliver a holistic view of execution, making it an essential tool for diagnosing performance bottlenecks and optimizing system behavior.

14:00-15:00 Session 12B: Track 6.2: Learning systems
14:00
Xinrui Yang (Harbin Institute of Technology, Shenzhen, China)
Shaohuai Shi (Harbin Institute of Technology, Shenzhen, China)
SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training

ABSTRACT. The data-parallel distributed training technique is a de facto approach in training large-scale deep neural networks (DNNs) using synchronous stochastic gradient descent (S-SGD). However, S-SGD requires iteratively aggregating the distributed gradients through an AllReduce collective, which easily results in significant data communication across distributed GPUs and thus limits the scaling efficiency of the training system. In this paper, we propose an efficient and practical gradient sparsification and quantization algorithm, named SQ-DeAR, which not only significantly reduces the communication traffic, but also allows overlapping communications with both feed-forward and backpropagation computations to further improve the training performance. In addition, to improve the computation efficiency of gradient sparsification, we design a batched gradient sparsification to reduce the number of GPU launches. Performance evaluation on a 32-GPU cluster shows that SQ-DeAR outperforms state-of-the-art solutions by 1.17x-7.0x.

14:20
Samuel Wiggins (University of Southern California, United States)
Nikunj Gupta (University of Southern California, United States)
Grace Zgheib (Altera, United States)
Mahesh Iyer (Altera, United States)
Viktor Prasanna (University of Southern California, United States)
Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms

ABSTRACT. Multi-Agent Reinforcement Learning (MARL) enables multiple autonomous agents to simultaneously learn and make decisions in complex, interactive environments. Among various MARL paradigms, Independent Learning (IL) remains a dominant approach due to its simplicity and scalability, where each agent optimizes its policy independently, treating others as part of the environment. While IL eliminates the need for explicit inter-agent communication, existing MARL implementations fail to exploit its inherent parallelism. Current implementations train agents sequentially on a single accelerator, leading to severe underutilization of modern compute resources, particularly on multi-GPU platforms. In this work, we propose a multi-GPU training scheme that efficiently distributes independent agent policies across compute devices without altering the original IL semantics. To further enhance scalability, we design a dynamic load-balancing strategy that adaptively assigns training workloads based on computational demands and the varying capabilities of different GPUs, ensuring efficient utilization of hardware resources. Our approach achieves up to 15.5x higher throughput than state-of-the-art MARL implementations, demonstrating that fully leveraging the parallelism of IL can significantly accelerate MARL training, opening new possibilities for large-scale multi-agent learning in high-dimensional environments. We open-source our work with optimized implementations of widely used independent learning algorithms, enabling scalable MARL training on diverse accelerator platforms.

14:40
Wenxiang Lin (Harbin Institute of Technology, Shenzhen, China)
Xinglin Pan (The Hong Kong University of Science and Technology, GuangZhou, China)
Shaohuai Shi (Harbin Institute of Technology, Shenzhen, China)
Xuan Wang (Harbin Institute of Technology, Shenzhen, China)
Xiaowen Chu (The Hong Kong University of Science and Technology, Guangzhou, China)
ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs.

ABSTRACT. Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can shrink model sizes but often impair accuracy, making them unsuitable for practical applications. In this work, we introduce \modelname{}, a high-performance inference engine designed to speed up LLM inference without compromising model accuracy. \modelname{} incorporates three innovative methods to increase inference efficiency: 1) model partitioning to allow asynchronous processing of tasks across CPU computation, GPU computation, and CPU-GPU communication, 2) an adaptive partition algorithm to optimize the use of CPU, GPU, and PCIe communication capabilities, and 3) a token assignment strategy to handle diverse prompt and generation tasks during LLM inference. Comprehensive experiments were conducted with various LLMs such as Mixtral, LLaMA-2, Qwen, and PhiMoE across three test environments featuring different CPUs and GPUs. The experimental findings demonstrate that \modelname{} achieves speeds between $1.11\times$ to $1.80\times$ faster in decoding and $1.69\times$ to $6.33\times$ faster in pre-filling, leading to an overall speedup ranging from $1.25\times$ to $2.04\times$ compared to state-of-the-art solutions, llama.cpp and Fiddler.

15:00-16:00Coffee Break and Poster Session
16:00-17:30 Session 13A: Track 2.2: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
16:00
Jianfeng Gu (Technical University of Munich, Germany)
Puxuan Wang (Technical University of Munich, Germany)
Isaac David Núñez Araya (Technical University of Munich, Germany)
Kai Huang (Sun Yat-sen University, China)
Michael Gerndt (Technical University of Munich, Germany)
HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

ABSTRACT. Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve performance uncertainty introduced by massive fine-grained resource configuration spaces, we propose the Resource-aware Performance Predictor (RaPP). Furthermore, we present an adaptive hybrid auto-scaling algorithm with both horizontal and vertical scaling to ensure inference SLOs and minimize GPU costs. The experiments demonstrated that compared to the mainstream serverless inference platform, HAS-GPU reduces function costs by an average of 10.8x with better SLO guarantees. Compared to state-of-the-art spatio-temporal GPU sharing serverless framework, HAS-GPU reduces function SLO violation by 4.8x and cost by 1.72x on average.

16:20
Yiming Sun (Institute of Computing Technology, Chinese Academy of Sciences, China)
Jiaqi Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China)
Jie Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China)
Huawei Cao (Institute of Computing Technology, Chinese Academy of Sciences, China)
Xuejun An (Institute of Computing Technology, Chinese Academy of Sciences, China)
Xiaochun Ye (Institute of Computing Technology, Chinese Academy of Sciences, China)
CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration

ABSTRACT. The serverless computing model offers users flexible, pay-as-you-go services. However, existing frameworks face challenges such as resource over-subscription and workload over-scaling when deploying graph processing jobs in serverless environments. To address these limitations, we introduce CGP–Graphless, which utilizes CPU–GPU heterogeneous computing to enable efficient vertical scaling. This approach divides graph processing into two phases: querying on a core proxy graph and correction on the full graph. GPU containers are allocated for the querying phase, while CPU containers handle the correction phase. Furthermore, we propose an adaptive pipelined scheduling strategy between these phases, which leverages pressure-aware intra-pipeline scaling to transform excessive horizontal scaling into vertical scaling, thereby achieving efficient serverless graph computation. Experiments show that CGP–Graphless improves end-to-end performance by up to 2.00x over FaaSGraph under concurrent stress evaluations, while halving allocated CPU cores through additional GPU containers. In short-interval query scenarios, CGP–Graphless further reduces average request latency by 3.30x compared to FaaSGraph.

16:40
Kaicheng Guo (Shanghai Jiao Tong University, China)
Jingyi Chen (Shanghai Jiao Tong University, China)
Yun Wang (Shanghai Jiao Tong University, China)
Semakin Anton (Huawei technologies co. ltd., Russia)
Tovmachenko Dmitry (Huawei technologies co. ltd., Russia)
Jiajie Sheng (Shanghai Jiao Tong University, China)
Jianwen Wei (Shanghai Jiao Tong University, China)
James Lin (Shanghai Jiao Tong University, China)
Zhengwei Qi (Shanghai Jiao Tong University, China)
Haibing Guan (Shanghai Jiao Tong University, China)
Design and Operation of Elastic GPU-pooling on Campus

ABSTRACT. With the rapid advancement of Artificial Intelligence (AI) and High-Performance Computing (HPC), GPU-centric clusters have become pivotal in driving research across diverse disciplines. Consequently, campus data centers are increasingly aggregating substantial computing resources. However, akin to commercial GPU and AI clusters, campus GPU clusters frequently encounter challenges related to GPU underutilization.

While pooling technologies offer a viable solution to this issue, existing GPU pooling approaches fall short in effectively supporting environments where multiple applications coexist within campus data centers.

This paper presents gPooling, a novel pooling scheme that leverages device driver hijacking to optimize GPU resource allocation. We designed a benchmark based on real-world traces from a campus data center and deployed gPooling within a GPU cluster environment. Experimental results from both benchmarking and actual deployment demonstrate that gPooling significantly enhances GPU utilization and reduces user waiting times, thereby improving the overall efficiency of campus GPU clusters.

17:00
Mingxuan Liu (Northwestern Polytechnical University, China)
Jianhua Gu (Northwestern Polytechnical University, China)
Tianhai Zhao (Northwestern Polytechnical University, China)
ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory

ABSTRACT. Embedding-based recommender systems (RecSys) face critical bottlenecks in storing massive embedding tables (EMTs) and accelerating latency-sensitive lookup operations. While serverless computing and disaggregated memory architectures offer resource efficiency, existing solutions struggle with cold-start penalties, state transfer overheads, and rigid resource scaling for EMT-bound workloads. We present ServerlessRec, a serverless framework that integrates memory disaggregation with kernel-space RDMA-based remote memory mapping (rmap) and remote fork (rfork) to enable elastic EMT lookups. ServerlessRec decouples compute nodes (CNs) from memory nodes (MNs), dynamically autoscaling serverless functions on CNs for bursty lookups while storing EMTs on MNs. ServerlessRec improves latency-bounded throughput by 3.8× over DisaggRec, reduces resource waste by 62% vs. ElasticRec, which demonstrates how kernel-level memory disaggregation unlocks serverless advantages for EMT-bound applications.

16:00-17:30 Session 13B: Track 6.3: Stream, Image and Sequence Processing
16:00
Apurv Deepak Kulkarni (Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany)
Siavash Ghiasvand (Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany)
SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure

ABSTRACT. Despite recent advancements in data stream processing frameworks that emphasize the importance of real-time data handling, scalability remains a significant challenge, directly impacting both throughput and latency. Although prior research has investigated this issue on local machines and cloud clusters, there remains a gap in studies exploring its implications for modern high performance computing (HPC) systems, primarily due to the scarcity of scalable measurement tools. This work introduces SProBench, a novel benchmark designed to evaluate scaling performance of data stream processing frameworks on HPC systems, incorporating a modular architecture and various workload pipelines. Notably, this benchmark suite is, to the best of our knowledge, the first to provide native support for SLURM-based clusters. The current implementation of SProBench enables out-of-the-box performance evaluation for Apache Flink, Apache Spark Streaming, and Apache Kafka Streams, while allowing for easy integration of additional frameworks with minor modifications. Building upon established best practices from previous research, SProBench is designed for effective execution on HPC systems and is highly scalable. Real-world cluster evaluations denote the benchmarks linear scaling capabilities. Furthermore, this study outlines the collection and integration of metrics from external monitoring tools for post-processing analysis. This data confirmed and validated the benchmarks functionality and efficient scaling performance.

16:20
Lifeng Yan (Shandong University, China)
Zekun Yin (Shandong University, China)
Qixin Chang (Shandong University, China)
Tong Zhang (Shandong University, China)
Zhisong Wang (Shandong University, China)
Xiaohui Duan (Shandong University, China)
Bertil Schmidt (Johannes Gutenberg University, Germany)
Weiguo Liu (Shandong University, China)
SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture

ABSTRACT. Sequence alignment is a crucial step in next-generation sequencing data analysis. However, most sequence aligners face performance challenges due to high computational complexity and extensive random memory access patterns, making them a significant bottleneck in the overall analysis pipeline, such as the industry gold standard BWA-MEM. The next-generation Sunway platform, with its high computational power and unique heterogeneous architecture, presents new opportunities for enhancing the efficiency of sequence alignment. In this work, we introduce SWBWA, a high-accuracy and high-performance sequence aligner designed for the new Sunway architecture. By redesigning the parallel framework tailored for Sunway, performing software prefetching optimization, vectorizing the striped Smith-Waterman algorithm, and addressing memory access bottlenecks in bigshare mode, SWBWA achieves a 330× speedup over the single-threaded unoptimized version. Additionally, SWBWA running on a Sunway workstation can achieve 1.2–1.4× speedups compared to BWA-MEM running on a dual-socket 48-core x86 server, while ensuring nearly identical output. The source code is publicly available at https://github.com/RabbitBio/SWBWA.

16:40
Marie Reinbigler (Télécom SudParis - Institut Polytechnique de Paris, France)
Rishi Sharma (EPFL, Switzerland)
Rafael Pires (EPFL, Switzerland)
Elisabeth Brunet (Télécom Sudparis - Institut Polytechnique de Paris, Inria, France)
Anne-Marie Kermarrec (EPFL, Switzerland)
Catalin Fetita (Télécom Sudparis - Institut Polytechnique de Paris, France)
Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster

ABSTRACT. Analyzing gigapixel images is recognized as computationally demanding. In this paper, we introduce PyramidAI, a technique for analyzing gigapixel images with reduced computational cost. The proposed approach adopts a gradual analysis of the image, beginning with lower resolutions and progressively concentrating on regions of interest for detailed examination at higher resolutions. We investigated two strategies for tuning the accuracy-computation performance trade-off when implementing the adaptive resolution selection, validated against the Camelyon 16 dataset of biomedical images. Our results demonstrate that PyramidAI substantially decreases the amount of processed data required for analysis by up to 2.65×, while preserving the accuracy in identifying relevant sections on a single computer. To ensure democratization of gigapixel image analysis, we evaluated the potential to use mainstream computers to perform the computation by exploiting the parallelism potential of the approach. Using a simulator, we estimated the best data distribution and load balancing algorithm according to the number of workers. The selected algorithms were implemented and highlighted the same conclusions in a real-world setting. Analysis time is reduced from more than an hour to a few minutes using 12 modest workers, offering a practical solution for efficient large-scale image analysis.