EURO-PAR 2025: 31ST INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR THURSDAY, AUGUST 28TH
Days:
previous day
next day
all days

View: session overviewtalk overview

10:00-10:30Coffee Break
10:30-12:30 Session 15: Best Paper Session
10:30
Aurélien Delval (SiPearl - Université Paris-Saclay, UVSQ, Li-PaRAD, France)
Pablo de Oliveira Castro (Université Paris-Saclay, UVSQ, Li-PaRAD, France)
William Jalby (Université Paris-Saclay, UVSQ, Li-PaRAD, France)
Etienne Renault (SiPearl, France)
Noise injection for performance bottleneck analysis

ABSTRACT. Bottleneck evaluation is a crucial part of performance tuning of HPC applications, as it directly influences the search for optimizations and the selection of the best hardware for a given code. In this paper, we introduce a new model-agnostic, instruction-accurate framework for bottleneck analysis based on performance noise injection. This method provides a precise analysis that complements existing techniques, particularly in quantifying unused resource slack. Specifically, we classify programs based on whether they are limited by computation, data access bandwidth, or latency by injecting additional noise instructions that target specific bottleneck sources. Our approach is built on the LLVM compiler toolchain, ensuring easy portability across different architectures and microarchitectures, which constitutes an improvement over many state-of-the-art tools. We validate our framework on a range of hardware benchmarks and kernels, including a detailed study of a sparse-matrix–vector product (SPMXV) kernel, where we successfully detect distinct performance regimes. These insights further inform hardware selection, as demonstrated by our comparative evaluation between HBM and DDR memory systems.

10:50
Louis-Claude Canon (Univ. Marie et Louis Pasteur, CNRS, institut FEMTO-ST, F-25000 Besançon, France)
Anthony Dugois (Univ. Marie et Louis Pasteur, CNRS, institut FEMTO-ST, F-25000 Besançon, France)
Ismaël Jecker (Univ. Marie et Louis Pasteur, CNRS, institut FEMTO-ST, F-25000 Besançon, France)
Pierre-Cyrille Heam (Univ. Marie et Louis Pasteur, CNRS, institut FEMTO-ST, F-25000 Besançon, France)
Approximation Bounds for SLACK on Identical Parallel Machines

ABSTRACT. We consider the problem of scheduling tasks on homogeneous machines with SLACK. This heuristic works by sorting tasks in non-increasing order of costs, dividing them into sets of size m, the number of processors, and then scheduling them on processors in non-increasing order of slack with a list heuristics. Similarly to LPT, SLACK also has a small time complexity, O(n log n) where n is the number of tasks, and shows favorable empirical performance in some settings. However, no approximation guarantee has been provided for this heuristic. We provide a 4/3-approximation ratio that is slightly worse than with LPT, and this ratio is tight. We also derive better bounds in the case task costs do not exceeds a fraction of the optimal makespan. In particular, we show that SLACK is a (k + 2)/(k + 1) − 1/((k + 1)m)-approximation algorithm when the cost of any task in below OP T /k for k ≥ 2.

11:10
Jiangying Xue (University of Electronic Science and Technology of China, China)
Tianyu Xiong (University of Electronic Science and Technology of China, China)
Lingwei Chao (University of Electronic Science and Technology of China, China)
Ruini Xue (University of Electronic Science and Technology of China, China)
SimPoint+: More Stable, Accurate and Efficient Program Analysis

ABSTRACT. This paper introduces SimPoint+, an enhanced sampled simulation methodology that addresses key limitations of the widely-used SimPoint approach. SimPoint+ achieves greater stability, accuracy, and efficiency in program analysis through three major improvements: (1) UMAP-based dimensionality reduction for Basic Block Vectors, (2) a two-stage clustering approach utilizing HDBSCAN, and (3) a lightweight cycle calibration method. Furthermore, an automated hyperparameter tuning strategy accommodates diverse program characteristics for the first two models. Evaluation on SPEC CPU 2006 benchmarks demonstrates that SimPoint+ significantly outperforms SimPoint by yielding more consistent results across runs, reducing cycle error rates by 3-5 orders of magnitude, and decreasing required simulation time by 25%-55% overall. SimPoint+ facilitates more reliable and efficient sampled simulation for computer architecture research, providing a robust foundation for rapid design space exploration and performance analysis of complex processors.

11:30
Xuanzheng Wang (Tsinghua University, China)
Shuo Miao (Tsinghua University, China)
Zihan Zhu (Tsinghua University, China)
Peng Qu (Tsinghua University, China)
Youhui Zhang (Tsinghua University, China)
AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference

ABSTRACT. As Large Language Models (LLMs) progressively scale in both model size and complexity, numerous pruning techniques have been proposed to mitigate the exponential growth of parameters. Bridging the gap between theoretical computational savings and practical performance gains faces fundamental challenges: (1) limited GPU support for unstructured sparsity, (2) mismatch between existing sparse kernels and LLM-specific sparsity requirements, and (3) heterogeneous sparsity patterns across model components.

We introduce AlphaSparseTensor, an automated algorithm discovery framework for Sparse Matrix-Matrix Multiplication (SpMM) optimization. Inspired by AlphaTensor's matrix multiplication paradigm, AlphaSparseTensor extends the concept to sparse domains by systematically minimizing block multiplication operations through dynamic-programming-based optimization. The framework automates workflow generation for efficient Multiply-Accumulate (MAC) plans via two key innovations: (1) adaptive sparsity pattern analysis to identify zero-blocks, and (2) hierarchical tiling strategies tailored to variable sparsity distributions and tensor dimensions. Furthermore, we optimize the GPU implementation of our matrix MAC paradigm, enhancing performance through computation-memory overlap and optimized memory layout restructuring.

Comprehensive evaluations demonstrate AlphaSparseTensor's performance across multiple benchmarks. On the Sparse Transformer dataset, it achieves speedup factors of 1.59x and 1.91x over cuBLAS and cuSPARSE respectively. For 70%-pruned LLaMA matrices (7B/13B/65B), our solution delivers average acceleration ratios of 4.05x (vs cuBLAS), 3.77x (vs cuSPARSE), 3.37x (vs PyTorch), and 2.39x (vs Sputnik). End-to-end inference tests on LLaMA (7B/13B/65B) show system-level improvements of 8.4x, 2.1x, 1.3x, and 1.2x respectively compared to CuBLAS, CuSPARSE, PyTorch, and Sputnik. The discovered algorithms have been open-sourced at:https://anonymous.4open.science/r/AlphaSparseTensor-0E9E

11:50
Jeffrey Spaan (University of Twente, Netherlands)
Kuan-Hsun Chen (University of Twente, Netherlands)
David A. Bader (NJIT, United States)
Ana-Lucia Varbanescu (University of Twente, Netherlands)
Wedge-Parallel Triangle Counting for GPUs

ABSTRACT. For fast processing of increasingly large graphs, triangle counting - a common building block of graph processing algorithms, is often performed on GPUs. However, applying massive parallelism to triangle counting is challenging due to the algorithm’s inherent irregular access patterns and workload imbalance. In this work, we propose WeTriC, a novel wedge-parallel triangle counting algorithm for GPUs, which, using fine(r)-grained parallelism through a lightweight static mapping of wedges to threads, improves load balancing and efficiency. Our theoretical analysis compares different parallelization granularities, while optimizations enhance caching, reduce work-per-intersection, and minimize overhead. Performance experiments indicate that WeTriC yields 5.63x and 4.69x speedup over optimized vertex-parallel and edge-parallel binary search triangle counting algorithms, respectively. Furthermore, we show that WeTriC consistently outperforms the state-of-the-art (i.e., on avg. 2.86x faster than Trust and 2.32x faster than GroupTC).

12:10
Abhijeet Sahu (Indian Institute of Technology Tirupati, India)
Andaluri S P V M Aditya (Indian Institute of Technology Tirupati, India)
Gadhamsetty Ramakrishna (Indian Institute of Technology Tirupati, India)
Malleti Sai Nikhil (Indian Institute of Technology Tirupati, India)
Kishore Kothapalli (International Institute of Information Technology Hyderabad, India)
Dip Sankar Banerjee (Indian Institute of Technology Jodhpur, India)
External GPU Biconnected Components

ABSTRACT. As the scale of graph analytics continues to grow, many applications require identifying biconnected components (BCCs) and cut vertices in graphs that exceed the memory capacity of a single GPU. This paper presents an out-of-core, GPU-based batch processing algorithm designed to efficiently compute BCCs and cut vertices in massive graphs that do not fit entirely into device memory. We propose a novel batch technique to process the graph incrementally, and maintain a Biconnectivity Compressed Graph to compute BCCs and cut vertices. Experimental results on a range of large-scale benchmark graphs demonstrate that our technique achieves competitive performance compared to state-of-the-art cpu solutions, enabling the handling of graph instances previously considered intractable on gpu platforms.

12:30-14:00Lunch Break
14:00-15:30 Session 16A: Track 1.2: Compilers, Optimizations, and Scheduling
14:00
Wei Li (Chongqing University, China)
Ao Ren (Chongqing University, China)
Qingqiu Lan (Chongqing University, China)
Haining Fang (Chongqing University, China)
Zhenyu Wang (Chongqing University, China)
Yujuan Tan (Chongqing University, China)
Kan Zhong (Chongqing University, China)
Duo Liu (Chongqing University, China)
CoSF: A Co-Optimization Framework for Operator Splitting and Fusion

ABSTRACT. Compound operators, such as Log_softmax and RMSNorm, have been widely studied to enhance performance in deep neural networks (DNNs). Nonetheless, these operators often suffer from high hardware adaptation costs and limited optimization effects. AI compilers optimize them through operator splitting and successive operator fusion strategies. However, prior studies indiscriminately split all compound operators and failed to fully explore the fusion search space, incurring inefficient fusion schemes. To overcome these limitations, we propose a Co-Optimization Framework for Operator Splitting and Fusion (CoSF). In the operator splitting phase, we analyze memory reuse levels among operators and classify the compound operators into three types, according to their data locality. Then, we propose a fusion-aware splitting strategy. For each type of compound operator, it evaluates the successive fusion benefits after splitting the compound operator and automatically generates operator splitting strategies. In the operator fusion phase, to reduce the massive computation graph resulting from operator splitting, we propose a dominator tree-based graph partitioning algorithm to efficiently partition the computation graph. We then employ dynamic programming for each partitioned subgraph to generate an optimized fusion strategy. Finally, we propose an evaluation model to select the most effective fusion solution from multiple candidates.Experimental results demonstrate that our CoSF achieves a 1.3–3.4× speedup on GPU and a 1.59–3.93× on CPU compared to AI compilers.

14:20
Jie Tong (University of Wisconsin-Madison, United States)
Wan-Luan Lee (University of Wisconsin–Madison, United States)
Umit Yusuf Ogras (University of Wisconsin-Madison, United States)
Tsung-Wei Huang (University of Wisconsin-Madison, United States)
Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR

ABSTRACT. As deep learning accelerators scale in complexity, efficient Register Transfer Level (RTL) simulation becomes crucial for reducing the long runtime of hardware design and verification. However, existing RTL simulators struggle with high compilation overhead and slow simulation performance, particularly for large deep learning accelerator designs, where components are heavily reused and hierarchically structured. This inefficiency arises because existing simulators repeatedly regenerate and recompile redundant code, failing to leverage the structural parallelism inherent in deep learning accelerators. To address this challenge, we propose ScaleRTL, a scalable and unified code generation flow that automatically produces optimized parallel RTL simulation code for deep learning accelerators. Built on the MLIR infrastructure, ScaleRTL identifies repetitive design patterns, reduces code size and compilation time, and generates efficient simulation executables that exploit both CPU and GPU parallelism. Compared to state-of-the-art RTL simulators, ScaleRTL achieves a compilation speedup of three to five orders of magnitude and up to 15x and 300x simulation speedup on CPU and GPU, respectively.

14:40
Ivo Gabe de Wolff (Utrecht University, Netherlands)
David van Balen (Utrecht University, Netherlands)
Gabriele Keller (Utrecht University, Netherlands)
Scheduling Task and Data Parallelism in Array Languages with Work Assisting

ABSTRACT. High level languages for parallelism need to be performant on a wide range of workloads: they may be data-parallel and/or task-parallel, as well as regular or irregular. Scheduling, which is implemented via an interaction between the runtime system and the generated code, has a significant impact on the performance and scalability of these languages. In this paper, we demonstrate the integration of Work Assisting, our dynamic scheduler combining task-parallel and data-parallel schedulers, in combinator-based parallel array languages. These languages require fusion for high performance, and often feature scans to support irregular computations. Chained scans, the fastest parallel scans in our experiments, require a data-parallel scheduler as provided by Work Assisting. We show how code can be generated with support for fusion and chained scans, which can also fuse better than classic three-phase scans. We present the integration of Work Assisting into an actual compiler and runtime system of a such a language, Accelerate, and evaluate its performance in this context for a range of applications.

15:00
Andre Rauber Du Bois (Universidade Federal de Pelotas, Brazil)
Gerson Cavalheiro (Universidade Federal de Pelotas, Brazil)
Polymorphic Higher-Order GPU Kernels

ABSTRACT. Graphics Processing Units (GPUs) are now widely used in computing systems, not only for graphics processing but also for general-purpose computing. Programming GPUs is challenging, as it is primarily done using low-level languages such as CUDA and OpenCL. Many approaches to simplifying GPU programming rely on algorithmic skeletons, i.e., higher-order functions that encapsulate common patterns of parallel computing. In these frameworks, programmers are provided with a set of skeletons, which must be combined to solve problems using the GPU. However, new skeletons can be implemented if they can be expressed as a combination of the available skeletons. Otherwise, extending skeleton libraries may require good knowledge of the underlying compiler/runtime system that supports them. This paper presents PolyHok, a low-level imperative domain-specific language (DSL) for GPU computing embedded in the Elixir functional language. PolyHok enables the implementation of polymorphic higher-order GPU kernels, i.e., GPU kernels that can accept device functions, including anonymous functions, as arguments and that are dynamically typed at runtime based on the arguments they receive. With such kernels, programmers can implement high-level abstractions typically associated with higher-order functions, such as algorithmic skeletons and array comprehensions. This paper details the design and current implementation of PolyHok and compares its performance with pure CUDA through experiments with six benchmarks.

14:00-15:30 Session 16B: Track 4.1: Scalable AI Optimization and Parallel Training
14:00
Xinjue Zheng (Huazhong University of Science and Technology, China)
Zhangqiang Ming (Huazhong University of Science and Technology, China)
Yuchong Hu (Huazhong University of Science and Technology, China)
Chenxuan Yao (Huazhong University of Science and Technology, China)
Wenxiang Zhou (Huazhong University of Science and Technology, China)
Rui Wang (Huazhong University of Science and Technology, China)
Xun Chen (Huazhong University of Science and Technology, China)
Dan Feng (Huazhong University of Science and Technology, China)
Saving Memory via Residual Reduction for DNN Training with Compressed Communication

ABSTRACT. Deep neural network (DNN) training systems suffer from communication bottlenecks among workers for gradient synchronization. Gradient compression reduces this overhead but impacts model accuracy, prompting the use of residuals to compensate for the loss. However, we observe that these residuals consume significant GPU memory but fortunately can be reduced with tiny accuracy impact. We propose ResiReduce, a memory-saving mechanism that reuses residuals across similar layers and applies strategic compression within specific layers. Experiments on local and cloud clusters show that ResiReduce can reduce the memory footprint of the model states by up to 15.7% while preserving the model accuracy and training throughput.

14:20
Jacob Garby (Chalmers University of Technology, Sweden)
Philippas Tsigas (Chalmers University of Technology, Sweden)
Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD

ABSTRACT. Stochastic gradient descent (SGD) is a crucial optimisation algorithm due to its ubiquity in machine learning applications. As the quantity of data used for training becomes ever greater, and the relevance of artificial intelligence to both industry and science increases, so too does the importance of SGD scalability. Parallelism is a popular approach, but the standard synchronous formulation struggles due to significant synchronisation overhead. For this reason, asynchronous implementations are increasingly common. These provide an improvement in throughput at the expense of introducing stale gradients reducing model accuracy. Previous approaches to mitigate the downsides of asynchronous processing include adaptively adjusting the number of worker threads or the learning rate, but at their core these are still fully asynchronous and therefore still suffer from lower accuracy due to more staleness.

We propose Interval-Asynchrony, a semi-asynchronous method which retains high throughput while reducing gradient staleness, both on average as well as with a hard upper bound. Our method achieves this by introducing periodic asynchronous intervals, within which SGD is executed asynchronously, but between which gradient computations may not cross. The size of these intervals determines the degree of asynchrony, providing us with an adjustable scale. Since we observe that the optimal interval size varies over time, we additionally provide two online strategies for dynamic adjustment thereof. We evaluate our method against several baselines training deep neural networks on the CIFAR-10 and CIFAR-100 datasets, and demonstrate a 32% increase in training time as well as improved scalability with up to 128 threads.

14:40
Sanjif Shanmugavelu (Groq Inc, UK)
Mathieu Taillefumier (CSCS Swiss National Supercomputing Centre, Switzerland)
Christopher Culver (Groq Inc, United States)
Oscar Hernandez (Oak Ridge National Laboratory, United States)
Vijay Ganesh (Georgia Tech, United States)
Ada Sedova (Oak Ridge National Laboratory, United States)
Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability

ABSTRACT. The ability of machine learning (ML) classification models to resist small, targeted input perturbations—known as adversarial attacks—is a key measure of their safety and reliability. We show that floating-point non associativity (FPNA) coupled with asynchronous parallel programming on GPUs is sufficient to result in misclassification, without any perturbation to the input. Additionally, we show this misclassification is particularly significant for inputs close to the decision boundary and that standard adversarial robustness results may be overestimated up to 4.6% when not considering machine-level details. We first study a linear classifier, before focusing on standard Graph Neural Network (GNN) architectures and datasets used in robustness assessments. We present a novel black-box attack using Bayesian optimization to determine external workloads that bias the output of reductions on GPUs and reliably lead to misclassification. Motivated by these results, we present a new learnable permutation (LP) gradient-based approach, to learn floating point operation orderings that lead to misclassifications, making the assumption that any reduction or permutation ordering is possible. This LP approach provides a worst-case estimate in a computationally efficient manner, avoiding the need to run identical experiments tens of thousands of times over a potentially large set of possible GPU states or architectures. Finally, we investigate parallel reduction ordering across different GPU architectures for a reduction under three conditions: (1) executing external background workloads, (2) utilizing multi-GPU virtualization, and (3) applying power capping. Our results demonstrate that parallel reduction ordering varies significantly across architectures under the first two conditions. These results and the methods developed here can help to include machine-level considerations into adversarial robustness assessments, which can make a difference in safety and mission critical applications.

15:00
Matyáš Brabec (Charles University, Czech Republic, Czechia)
Jiří Klepl (Charles University, Czech Republic, Czechia)
Michal Töpfer (Charles University, Czech Republic, Czechia)
Martin Kruliš (Charles University, Czech Republic, Czechia)
Tutoring LLM into a Better CUDA Optimizer

ABSTRACT. Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the most recent reasoning models to generate optimized CUDA code for predefined, well-known tasks. Our objective is to determine which types of code optimizations and parallel patterns the LLMs can perform by themselves and whether they can be improved by tutoring (providing more detailed hints and guidelines in the prompt). The generated solutions were evaluated both automatically (for correctness and speedup) and manually (code reviews) to provide a more detailed perspective. We also tried an interactive approach where the LLM can fix its previous mistakes within a session. The results indicate that LLMs are quite skilled coders; however, they require tutoring to reach optimized solutions provided by parallel computing experts.

14:00-15:30 Session 16C: Track 3.2: Architecture
14:00
Hao Lan (Institute of Computing Technology,Chinese Academy of Science;ZGC Laboratory;University of Chinese Academy of Sciences, China)
Ziang Zhou (Institute of Computing Technology,Chinese Academy of Science;University of Chinese Academy of Sciences,Beijing, China, China)
Qi Zhu (Institute of Computing Technology,Chinese Academy of Science;University of Chinese Academy of Sciences,Beijing, China, China)
Wei Yan (Institute of Computing Technology,Chinese Academy of Science;ZGC Laboratory;University of Chinese Academy of Sciences, China)
Qinfen Hao (Institute of Computing Technology,Chinese Academy of Science;ZGC Laboratory;University of Chinese Academy of Sciences, China)
Xiaochun Ye (Institute of Computing Technology,Chinese Academy of Science;University of Chinese Academy of Sciences,Beijing, China, China)
Yong Liu (ZGC Laboratory;Qi-AnXin Technology Group, QAX Security Center, Xicheng District, Beijing, China, China)
Ninghui Sun (Institute of Computing Technology,Chinese Academy of Science;ZGC Laboratory;University of Chinese Academy of Sciences, China)
ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs

ABSTRACT. As RISC-V multi-core platforms advance into the domains of high-performance computing and cloud, safeguarding code and sensitive data through Trusted Execution Environments (TEEs) has become critical. Current RISC-V TEEs struggle to support parallel computing due to limitations in memory protection mechanisms. To address these limitations, we present ParTEE, a novel TEE framework designed to enable multi-threaded execution within RISC-V enclaves. ParTEE allows multiple threads to access shared memory regions within the enclave, thereby supporting parallel computing in RISC-V TEEs. To protect the security of multi-threaded programs, we incorporate two security mechanisms: (1) a secure thread detector that identifies potentially malicious threads, ensuring that secure threads can access shared memory regions while preventing unauthorized access; and (2) a secure monitor (SM) operating at the highest privilege level, responsible for managing shared memory access permissions for secure threads. ParTEE is compatible with various open-source RISC-V architectures. We conduct function validation using QEMU emulators, and deploy ParTEE on Xilinx KC705 FPGAs featuring a four-core RISC-V system. ParTEE demonstrates negligible performance overhead and achieves a 3.59× speedup compared to conventional RISC-V TEEs. Finally, we illustrate capability with a machine learning application.

14:20
Ruimin Shi (KTH Royal Institute of Technology, Sweden)
Gabin Schieffer (KTH Royal Institute of Technology, Sweden)
Maya Gokhale (Lawrence Livermore National Laboratory, United States)
Pei-Hung Lin (Lawrence Livermore National Laboratory, United States)
Hiren Patel (University of Waterloo, Canada)
Ivy Peng (KTH Royal Institute of Technology, Sweden)
ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace

ABSTRACT. Vector architectures are essential for boosting computing throughput. ARM provides SVE as the next-generation length-agnostic vector extension beyond traditional fixed-length SIMD. This work provides a first study of the maturity and readiness of exploiting ARM and SVE in HPC. Using selected performance hardware events on the ARM Grace processor and analytical models, we derive new metrics to quantify the effectiveness of exploiting SVE vectorization to reduce executed instructions and improve performance speedup. We further propose an adapted roofline model that combines vector length and data elements to identify potential performance bottlenecks. Finally, we propose a decision tree for classifying the SVE-boosted performance in applications.

14:40
Jin Pu (Shanghai Jiao Tong University, China)
Shengan Zheng (Shanghai Jiao Tong University, China)
Penghao Sun (Shanghai Jiao Tong University, China)
Guifeng Wang (Shanghai Jiao Tong University, China)
Xin Xie (Shanghai Jiao Tong University, China)
Linpeng Huang (Shanghai Jiao Tong University, China)
CSGC: Collaborative File System Garbage Collection with Computational Storage

ABSTRACT. Garbage collection (GC) in log-structured file systems (LFS) is known to cause performance degradation, particularly in write-intensive scenarios. Existing approaches, such as in-storage migration and hotness-based grouping, aim to enhance GC efficiency. However, these approaches lack effective host-device collaboration, leading to either excessive communication overhead from inefficient task offloading or severe write amplification due to the log-on-log issue. We present CSGC, a host-device collaborative GC approach that utilizes computational storage device (CSD) to optimize GC efficiency. CSGC uses a pipelined CSD-offloaded migration framework with metadata piggybacking to reduce host-device communication overhead, along with a separate flash translation layer (sFTL) to preserve data hotness and mitigate write amplification. Our evaluations using F2FS and Daisy+ OpenSSD show that CSGC significantly improves GC performance, contributing to up to 3.6× and 1.9× speedup in I/O throughput over vanilla F2FS and IPLFS, respectively.

15:00
Zhenxuan Xiong (National University of Defense Technology, China)
Libo Huang (National University of Defense Technology, China)
Ling Yang (National University of Defense Technology, China)
Hui Guo (National University of Defense Technology, China)
Junhui Wang (National University of Defense Technology, China)
Zheng Zhong (National University of Defense Technology, China)
Songwen Pei (University of Shanghai for Science and Technology, China)
Gang Chen (Sun Yat-sen University, China)
Yongwen Wang (National University of Defense Technology, China)
SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches

ABSTRACT. When handling a large number of hard-to-predict(H2P) branches, even the state-of-the-art branch predictor, TAGE-SC-L, suffers from severe table entry allocation pressure, hindering its predictive performance. Because TAGE cannot easily extract correlation from relevant history, it needs to allocate plenty of entries to memorize these branches. Using neural networks to predict these H2P branches is an effective approach. However, most existing studies are based on offline methods, these models are only effective for data similar with the training data, and the expensive training and inference process makes it difficult to be practically applied in processors. To explore more practical solutions, we propose SONet, an shallow online neural network for H2P branches, with a practical training and inference architecture. At runtime, SONet identify and select the H2P branches, offloading suitable ones to SONet for specialized prediction, while TAGE-SC-L predicts the remaining branches. Experiments shows, it improves program prediction performance where mispredictions are concentrated in a few branches. Over a set of workloads including CBP-5 and SPEC2017, a 16KB SONet backing 64KB TAGE-SC-L reduces the MPKI by 1.8%. Compared to a TAGE-SC-L of the equal capacity, our method decreases MPKI by 0.7% within acceptable prediction latency.

15:30-16:00Coffee Break
16:00-17:30 Session 17A: Track 3.3: Caching and Memory for ML
16:00
Mengyue Xi (Sun Yat-sen University, China)
Jingyi He (Sun Yat-sen University, China)
Xianwei Zhang (Sun Yat-sen University, China)
CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency

ABSTRACT. Each new generation of GPUs significantly enhances the resources available for diverse applications, with kernel concurrency playing a crucial role in maximizing utilization and boosting performance. However, existing kernel concurrency strategies usually tend to neglect cache contention, where concurrent kernels potentially target the same cache levels. Traditional cache management methods are inadequate for addressing this issue, as they focus on individual kernels without heavily considering inter-kernel interactions. To overcome these challenges, we propose CacheC, a method that utilizes large language models (LLMs) with in-context learning (ICL) to analyze cache affinity at the granularity of individual load instructions. For each kernel pair, CacheC extracts detailed features of all loads, evaluates their cache affinity across levels, and scores their suitability for concurrency. Based on these scores, CacheC not only selects kernel pairs with appropriate cache compatibility but also formulates load-specific cache bypassing strategies to enhance utilization. By iteratively scheduling kernel pairs and adjusting their cache policies, CacheC dynamically optimizes cache utilization and reduces cache contention during concurrent kernel execution. Experiments on off-the-shelf GPUs demonstrate that CacheC achieves a 19.67% reduction in turnaround time and a 24.48% improvement in throughput. It also delivers an average speedup of 1.337× across scheduled kernel pairs, showcasing its effectiveness in alleviating cache contention and enhancing kernel concurrency performance.

16:20
Zhaoyang Zeng (Chongqing University, China)
Yujuan Tan (Chongqing University, China)
Jiali Li (Tsinghua University, China)
Zhuoxin Bai (Chongqing University, China)
Kan Zhong (Chongqing University, China)
Duo Liu (Chongqing University, China)
Ao Ren (Chongqing University, China)
Cocache: An Accurate And Low-overhead Dynamic Caching Method for GNNs

ABSTRACT. Graph Neural Network (GNN) training often faces a critical bottleneck in feature extraction and CPU-to-GPU transfers. Caching frequently accessed nodes' features in GPU memory can mitigate this, but existing caching strategies fail in uniform graphs where nodes share similar edge connectivity. In such graphs, nodes share similar neighbor counts, thus any node is sampled with similar probability during neighbor sampling, leading to two access traits: (1) No persistent hotspot nodes, and (2) Node access is highly dynamic. These traits challenge existing caching approaches: (1) Static caching strategies keep cached nodes fixed during training, which cannot align with the absence of persistent hotspot nodes. (2) Existing dynamic caching strategies rely solely on recent node access order, failing to capture true access patterns and adapt to rapid node hotness changes. As a result, existing strategies suffer from frequent cache misses and degraded performance in uniform graphs. To address this, we propose cocache, a novel dynamic caching method to improve GNN training efficiency. It has two innovations: (1) It accurately determines hot nodes by tracking global node access pattern during an entire training epoch, and (2) It updates cache with low-overhead through a lightweight update decision strategy and a efficient CPU-GPU collaborative architecture. This dual-optimizations enable for accurate and low-overhead cache update during training, accelerating GNN training by 1.2x-1.48x compared to existing methods.

16:40
Yi Luo (Southwest University of Science and Technology, China)
Yaobin Wang (Southwest University of Science and Technology, China)
Qi Wang (Southwest University of Science and Technology, China)
Yingchen Song (Southwest University of Science and Technology, China)
Huan Wu (Southwest University of Science and Technology, China)
Qingfeng Wang (Southwest University of Science and Technology, China)
Jun Huang (Southwest University of Science and Technology, China)
DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System

ABSTRACT. Graph Neural Networks (GNNs) commonly employ sampling-based methods for inference on large-scale real-world graphs. However, the inherent characteristics of sampling lead to redundant data loading during GNN inference, while slow data transfer between the host and GPU exacerbates the issues of slow inference and low resource utilization. Current methods to accelerate GNN inference face several challenges: (1) low GPU resource utilization; (2) neglect of adjacency matrix locality; and (3) long preprocessing times. To address these issues, we propose DCI, a system designed to accelerate GNN inference. The system provides a simple and effective cache capacity allocation and filling strategy that can adapt flexibly to different workload demands. During the pre-sampling phase, DCI allocates and fills cache capacities for node features and adjacency matrices based on workload patterns. Experimental results show that DCI accelerates sampling and node feature loading, achieving end-to-end inference speedups of 1.18× to 11.26× compared to DGL, and 1.14× to 13.68× compared to RAIN, while reducing preprocessing times by 52.8% to 98.7%. Additionally, DCI outperforms existing single-cache inference systems with speedups ranging from 1.08× to 1.32×. We also compared DCI with DUCATI's dual-cache population strategy, and DCI achieves nearly identical inference speeds while reducing preprocessing time to less than 20% of DUCATI's time.

17:00
Kazi Asifuzzaman (Oak Ridge National Laboratory, United States)
Aaron Young (Oak Ridge National Laboratory, United States)
Prasanna Date (Oak Ridge National Laboratory, United States)
Shruti Kulkarni (Oak Ridge National Laboratory, United States)
Narasinga Rao Miniskar (Oak Ridge National Laboratory, United States)
Matthew Marinella (Arizona State University, United States)
Jeffrey Vetter (Oak Ridge National Laboratory, United States)
ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors

ABSTRACT. With Moore's law approaching its end, traditional von Neumann architectures are struggling to keep up with the exceeding performance and memory requirements of artificial intelligence and machine learning algorithms. Unconventional computing approaches such as neuromorphic computing that leverage spiking neural networks (SNNs) to perform computation are gaining traction and seek the paradigm shift necessary to sustain the increasing demands of modern applications. Novel memory technologies, such as resistive RAM (ReRAM), employ a crossbar architecture that possesses the inherent capability of efficiently computing vector-matrix multiplication - a dominant operation in SNNs. The prospect of naturally mapping SNNs to the crossbar structures provides a unique opportunity for achieving a high-performance, power-efficient neuromorphic system. In this work, we present ReSpike, which is a new framework, behavioral simulator, and architectural design based on ReRAM crossbar architectures, enabling modeling and co-design to achieve efficient execution of SNNs. We drive this co-design forward by quantifying the impact that ReRAM cell nonidealities have on the corresponding accuracy of an SNN application.

16:00-17:30 Session 17B: Track 2.3: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
16:00
Yang Xu (University of Science and Technology of China, China)
Zhiwei Yao (University of Science and Technology of China, China)
Hongli Xu (University of Science and Technology of China, China)
Yunming Liao (University of Science and Technology of China, China)
Zuan Xie (University of Science and Technology of China, China)
MPLS: Stacking Diverse Layers into One Model for Decentralized Federated Learning

ABSTRACT. Traditional Federated Learning (FL) enables collaborative training of deep neural networks (DNNs) across massive edge devices while preserving data privacy. However, its reliance on a centralized parameter server (PS) introduces communication bottlenecks and security risks. To address these issues, Decen- tralized Federated Learning (DFL) has emerged, which adopts peer-to-peer (P2P) communication to eliminate the PS. Despite its promise, DFL faces critical chal- lenges: (1) limited bandwidth resources, (2) dynamic network conditions, and (3) data heterogeneity among devices. To conquer these challenges, we design and implement a communication-efficient DFL framework with peer and layer selection, namely MPLS, which has the following advantages. 1) Different from exchanging an entire model between two workers in previous works, each worker just collects multiple sub-models (i.e., some critical layers) from the chosen peers and stacks them into one model for aggregation. 2) MPLS adopts asynchronous training among workers without any coordinator and enables each worker to develop the peer and layer selection strategy adaptively via the proposed list scheduling algorithm. We implement MPLS on a physical platform, and extensive experiments on real-world DNNs and datasets demonstrate that MPLS achieves 2.1-4.2× speedup compared to the baselines.

16:20
Roopkatha Banerjee (Indian Institute of Science, India)
Tejus Chandrashekar (Indian Institute of Science, India)
Ananth Eswar (Indian Institute of Science, India)
Yogesh Simmhan (Indian Institute of Science, India)
Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators

ABSTRACT. Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, optimizing both energy efficiency and model accuracy remains a challenge, given device and data heterogeneity. Further, sustainable AI through a global energy budget for FL has not been explored. We propose a novel optimization problem for client selection in FL that maximizes the model accuracy within an overall energy limit, and reduces training time. We solve this with a unique bi-level ILP formulation that leverages approximate Shapley values and energy–time prediction models to efficiently solve this. Our FedJoule framework achieves superior training accuracies compared to SOTA and simple baselines for diverse energy budgets, non-IID distributions, and realistic experiment configurations, performing 15% and 48% better on accuracy and time, respectively. The results highlight the effectiveness of our method in achieving a viable trade-off between energy usage and performance in FL environments.

16:40
Volodia Parol-Guarino (Centre INRIA de l'Université de Rennes, France)
Nikos Parlavantzas (INSA Rennes, France)
Auction-based Placement of Functions in the Fog at Scale

ABSTRACT. Function-as-a-Service (FaaS) is a programming model in which applications are formed by chaining ephemeral computa- tion units referred to as functions. FaaS is particularly suitable for developing fog-native applications by enabling flexible, on-demand placement of functions across the cloud-to-thing continuum. This continuum encompasses diverse fog nodes ranging from cloud servers to myriads of resource-constrained and geo-distributed devices. Although many recent studies have focused on efficiently placing functions on fog resources, limited attention has been given to application latency requirements. Moreover, few studies have considered the multiple entities that own fog nodes and explored mechanisms to incentivize fog node owners to share resources within the same fog network to improve quality of service for clients. This paper addresses the FaaS function placement problem in the fog through a market-based approach. Clients submit function placement requests with expected guarantees over network latency and allocated resources, encapsulated within a Service-Level Agreement (SLA). A marketplace then organizes an auction where fog nodes bid on the SLA to determine the node that will host the function and the revenue of the fog node owner. Our approach is evaluated by emulating networks of fog nodes, utilizing our reproducible and open-source testbed running on the Grid’5000 infrastructure. We evaluate various cooperative baselines on the same testbed and demonstrate that our approach reduces client spending by 2.9 to 3.3 times while maintaining the expected latency across fog networks with up to 663 nodes, under realistic loads from FaaS function chains.

17:00
Giuseppe Coviello (NEC Laboratories America, Inc., United States)
Kunal Rao (NEC Laboratories America, Inc., United States)
Mohammad Khojastepour (NEC Laboratories America, United States)
Srimat T. Chakradhar (NEC Labs, United States)
Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems

ABSTRACT. Agentic AI systems rely on Large Language Models (LLMs) to execute complex tasks by invoking external functions. The efficiency of these systems depends on how well function execution is managed, especially under heterogeneous and high-variance workloads, where function execution times can range from milliseconds to several seconds. Traditional load-balancing techniques, such as round-robin, least-loaded, and Peak-EWMA (used in Linkerd), struggle in such settings: round-robin ignores load imbalance, least-loaded reacts slowly to rapid workload shifts, and Peak-EWMA relies on latency tracking, which is ineffective for workloads with high execution time variability. In this paper, we introduce Bifröst, a peer-to-peer load-balancing mechanism that distributes function requests based on real-time active request count rather than latency estimates. Instead of relying on centralized load-balancers or client-side decisions, Bifröst enables function-serving pods to dynamically distribute load by comparing queue lengths and offloading requests accordingly. This avoids unnecessary overhead while ensuring better responsiveness under high-variance workloads. Our evaluation on open-vocabulary object detection, multi-modal understanding, and code generation workloads shows that Bifröst improves function completion time by up to 20% when processing 13,700 requests from 137 AI agents on a 32-node Kubernetes cluster, outperforming both OpenFaaS and OpenFaaS with Linkerd. In an AI-driven insurance claims processing workflow, Bifröst achieves up to 25% faster execution.

16:00-17:30 Session 17C: Track 4.2: Efficient AI Inference and Model Serving at Scale
16:00
Ao Chen (Institute of Computing Technology, Chinese Academy of Sciences, China)
Guangli Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Feng Yu (Institute of Computing Technology, Chinese Academy of Sciences, China)
Xueying Wang (Beijing University of Posts and Telecommunications, China)
Jiacheng Zhao (Institute of Computing Technology at Chinese Academy of Sciences, China)
Huimin Cui (Institute of Computing Technology at Chinese Academy of Sciences, China)
Xiaobing Feng (Institute of Computing Technology at Chinese Academy of Sciences, China)
Jingling Xue (The University of New South Wales, Australia)
TopServe: Task-Operator Co-Scheduling for Efficient Multi-DNN Inference Serving on GPUs

ABSTRACT. Emerging intelligent applications often require collaborative inference from multiple deep neural networks (multi-DNNs) to support complex tasks like augmented and virtual reality. However, efficiently serving multi-DNNs is challenging due to heterogeneous model structures, parallelism strategies, and dynamic batching behaviors. Existing methods either use online task-level scheduling for batched inference or offline operator-level scheduling to optimize concurrency. These approaches, limited to a single perspective, may lead to sub-optimal performance in evolving multi-DNN serving scenarios. In this paper, we present TopServe, an efficient multi-DNN serving system that integrates dynamic batching with adaptive inter-operator parallelization strategies. During the offline phase, TopServe partitions the multi-DNN model into balanced subgraphs and generates candidate operator scheduling strategies. In the online phase, TopServe performs task-operator co-scheduling, combining effective batching with optimized operator parallelization. Our extensive evaluation shows that TopServe can significantly reduce the average latency and improve the throughput compared to state-of-the-art solutions.

16:20
Tianyu Guo (Sun Yat-Sen University, China)
Hande Dong (Tencent, China)
Yichong Leng (University of Science and Technology of China, China)
Feng Liu (Tencent, China)
Cheater Lin (Tencent, China)
Nong Xiao (Sun Yat-sen University, China)
Xianwei Zhang (Sun Yat-sen University, China)
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

ABSTRACT. Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling capability.

16:40
Nicolás Hernández González (Universidad de La Laguna, Spain)
Pedro Antonio Toledo Delgado (Universidad de La Laguna, Spain)
Vicente José Blanco Pérez (Universidad de La Laguna, Spain)
Francisco Carmelo Almeida Rodríguez (Universidad de La Laguna, Spain)
2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy

ABSTRACT. Efficient deployment of deep learning models on edge devices is critical for real-time applications. While 2:4 structured pruning has been recently studied in high-performance GPUs, its viability for edge devices has received less attention, despite its potential benefits in resource-constrained environments. This paper investigates its impact on performance, energy efficiency, and accuracy on the Nvidia Jetson Orin, leveraging the sparse tensor cores on this architecture to assess its practicality for edge computing. We conduct comprehensive experiments on several deep learning architectures, including convolutional neural networks and a transformer-based system, focusing on key metrics such as inference latency, power consumption, and predictive accuracy. The results indicate that 2:4 pruning has a limited effect on performance and energy efficiency, except for residual networks and the transformer. However, this pruning technique shows promising results in terms of size reduction and accuracy recovery, demonstrating the ability to regain accuracy efficiently by adjusting the pruning criterion. These findings provide valuable insights into the trade-offs associated with sparsity-driven optimization and offer guidelines for deploying high-performance models in resource-constrained environments.

17:00
Cheng Gu (Shanghai Jiao Tong University, China)
Gang Li (Institute of Automation, Chinese Academy of Sciences, China)
Xuan Zhang (Shanghai Jiao Tong University, China)
Jiayao Ling (Shanghai Jiao Tong University, China)
Xiaolong Lin (Shanghai Jiao Tong University, China)
Zhuoran Song (Shanghai Jiao Tong University, China)
Jian Cheng (Institute of Automation, Chinese Academy of Sciences, China)
Xiaoyao Liang (Shanghai Jiao Tong University, China)
Light-DiT: An Importance-Aware Dynamic Compression Framework for Diffusion Transformers

ABSTRACT. Diffusion Transformers (DiTs) demonstrate remarkable generative abilities in AI. However, the iterative denoising process inherent in diffusion incurs substantial computational and memory access costs, impeding its fast and energy-efficient edge inference. To mitigate the overhead of denoising, in this paper we propose a post-training framework that jointly utilizes pruning and quantization for hardware-efficient DiT inference, which is based on the observation that not all denoising blocks within a model are equally important during image generation. To achieve importance-aware dynamic compression, we introduce metrics to assess the importance of DiTs' blocks and layers, and then unify mixed-sparsity pruning and mixed-precision quantization based on the importance metrics. Experiments show that our approach achieves a 1.41× inference speedup through pruning with a mixed precision of W3.2A4.9, while incurring minimal accuracy loss. Furthermore, evaluation on bit-flexible DNN accelerators demonstrates up to 2.78× performance improvement and 1.99× better energy efficiency can be achieved compared to W8A8 quantization without pruning.