Program for Friday, August 29th

PROGRAM FOR FRIDAY, AUGUST 29TH

Days:

previous day

all days

View: session overview talk overview

09:00-10:00 Session 43: Keynote 3: Florina Ciorba

10:00-10:10 Session 44: Announcement of Euro-Par 2026

10:10-10:30Coffee Break

10:30-12:00 Session 45A: Track 2.4: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows

10:30	Yunling Chen (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, China) Qingyin Lin (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, China) Zhitao Chen (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, China) Yang Ou (College of Computer Science and Technology, National University of Defense Technology, Changsha, China, China) Zhiguang Chen (School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, China) DynoInfer: Adaptive Resource Orchestration for LLM Inference on Resource-Constrained PCs ABSTRACT. As LLMs become popular, there is a growing trend toward deploying LLMs on personal computers(PCs) to save inference costs, reduce network latency, and enhance privacy protection. However, the limited hardware resources of PCs and inefficient resource allocation of existing frameworks slow down the LLM inference. While recent works focus on reducing the resources consumed by LLM inference, they overlook the fact that available resources are changing dynamically due to the multitask nature of PCs. This paper presents DynoInfer, an LLM inference framework that dynamically allocates resources, specifically designed for resource-constrained personal computing environments. Basically, DynoInfer captures and utilizes CPU and memory resources that are not consumed by other running workloads in PCs, to improve overall resource utilization and reduce resource contention. For CPU resources, a locality-aware dynamic thread allocation strategy is adopted to adjust threads allocated for LLM inference. For memory resources, a parameter preloading with dynamic residing strategy is employed to carefully load parameters from storage to memory. Furthermore, DynoInfer capitalizes on idle periods to compress historical conversations, extending the memorized context length under resource-constrained scenarios. Evaluations show that DynoInfer can achieve an inference speedup of 5.76x compared to llama.cpp, while it hardly interferes with other user workloads.
10:50	Yunlan Wang (Northwestern Polytechnical University, China) Yutong Liu (Northwestern Polytechnical University, China) Jianhua Gu (Northwestern Polytechnical University, China) Tianhai Zhao (Northwestern Polytechnical University, China) Zhengxiong Hou (Northwestern Polytechnical University, China) Chengwen Zhong (Northwestern Polytechnical University, China) Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning ABSTRACT. Containers have become the primary deployment method for cloud applications. Accurate prediction of container workload is crucial for efficient resource allocation and management in cloud platforms, energy optimization in cloud data centers, and load balancing in cloud environments. Statistical models and neural network models are commonly used for container workload prediction. However, statistical models difficult to capture complex workload variations, and neural network models require extensive historical data, which is a challenge due to the short lifespan of containers. In this paper, we proposed a container workload prediction model based on deep domain adaptation in transfer learning. The proposed model comprises a feature extractor, a domain adversarial adapter, and a workload predictor. The feature extractor utilizes self-attention and cross-attention mechanisms to extract private and shared features from the source and target domains. The domain adversarial adapter reduces distribution differences between domains through adversarial learning. The workload predictor directly applies a prediction model trained on source-target container data to predict the workload of target containers while avoiding performance degradation caused by domain drift. To validate the prediction accuracy of the proposed model, this study utilized containers from the Alibaba cluster-trace-v2017 dataset that met the small sample condition as the target domain container set, and containers from the Google cluster-usage traces v3 dataset as the source domain container set.Experimental results demonstrated that the proposed system achieved a substantial improvement in prediction accuracy compared to other approaches, with an average decrease in MAPE of 56.82\%, 47.39\%, 31.77\%, 50.13\%, and 56.63\% compared to the Autoregressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM), Autoregressive Recurrent Neural Network (DeepAR), Deep Renewal Processes (DRP), and Multivariate Quantile Function Forecaster (MQF2) models, respectively.
11:10	Sunil Kumar (IIIT-Delhi, India) Vivek Kumar (IIIT-Delhi, India) KarmaPM: Reward-Driven Power Manager ABSTRACT. Hardware overprovisioning is a widely used technique to improve the average power utilization of computing systems by capping the processor's power consumption. However, applying a uniform power cap across multiprocessor system sockets can significantly impact co-running applications due to workload variations. This paper introduces KarmaPM, a novel power management library for co-running applications on multiprocessor systems, independent of the parallel programming model, based on application power donation karmas1. KarmaPM dynamically redistributes power bidirectionally across the sockets to improve overall system throughput for co-running applications while maintaining fairness between them. KarmaPM periodically profiles the CPU utilization of each socket. When it detects a socket underutilizing its CPU resources, it donates the surplus power from this donor socket to the other sockets (receivers), exhibiting high CPU utilization. When donor sockets enter a high CPU utilization phase, KarmaPM employs a reward power scheme that rewards the donor sockets by returning a portion of the transferred power. We evaluated KarmaPM across various exascale proxy application mixes and power caps on a four-socket, 72-core Intel Cooper Lake processor. Our results show that KarmaPM improved the system throughput (geometric mean) by 13.2% at a lower power cap and 6.6% at a higher power cap. Additionally, KarmaPM delivered improvements of 12.5% and 4.4% in system throughput (geomean) compared to state-of-the-art at these respective power caps.
11:30	Nobel Dhar (Kennesaw State University, United States) Bobin Deng (Kennesaw State University, United States) Md Romyull Islam (Kennesaw State University, United States) Xinyue Zhang (Kennesaw State University, United States) Kazi Fahim Ahmad Nasif (Kennesaw State University, United States) Kun Suo (Kennesaw State University, United States) A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering ABSTRACT. Large Language Models (LLMs) exhibit significant activation sparsity, where only a subset of neurons are active for a given input. While this sparsity presents opportunities for reducing computational cost, efficiently utilizing it requires predicting activation patterns in a scalable manner. However, direct prediction at the neuron level is computationally expensive due to the vast number of neurons in modern LLMs. To enable efficient prediction and utilization of activation sparsity, we propose a clustering-based activation pattern compression framework. Instead of treating each neuron independently, we group similar activation patterns into a small set of representative clusters. By predicting cluster assignments rather than individual neuron states, future models can efficiently infer activation patterns from precomputed centroids. Our approach significantly reduces the complexity of activation modeling while preserving model accuracy, ensuring minimal degradation in perplexity (PPL) scores. We detail the clustering algorithm, analyze its effectiveness in capturing meaningful activation structures, and demonstrate its potential for improving sparse computation efficiency. This clustering-based formulation serves as a foundation for future work on activation pattern prediction, paving the way for efficient inference in large-scale language models.

10:30-12:00 Session 45B: Track 4.3: Distributed systems, Compression, and Federated Applications

10:30	Zhichen Feng (Computer Network Information Center, Chinese Academy of Sciences & University of Chinese Academy of Sciences, China, China) Xin Zhang (Computer Network Information Center, Chinese Academy of Sciences & University of Chinese Academy of Sciences, China, China) DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model ABSTRACT. We propose DiffNO, a novel framework that synergizes diffusion models with kernel-integrated neural operators to solve nonlinear partial differential equations (PDEs). Compared to deterministic architectures, diffusion models have a better chance of learning the complex mappings in the evolution of nonlinear PDEs due to the introduction of stochastic variables. However, it is challenging for diffusion models to directly learn complex nonlinear mappings on their own. To address these limitations, DiffNO constructs a kernel-integrated diffusion operator, incorporating the structure of Green's functions as prior knowledge into the diffusion drift term, thereby establishing a physically constrained stochastic evolution process. On a dataset containing various types of PDEs, especially including two highly nonlinear PDEs, the 2D Cahn-Hilliard and reaction-diffusion systems, our method significantly outperforms current state-of-the-art methods.
10:50	Loris Belcastro (University of Calabria, Italy) Paolo Ferragina (Sant’Anna School of Advanced Studies, Italy) Giovanni Manzini (University of Pisa, Italy) Fabrizio Marozzo (DIMES, University of Calabria, Italy) Domenico Talia (University of Calabria, Italy) Paolo Trunfio (DEIS, University of Calabria, Italy) Scalable Compression of Massive Data Collections on HPC Systems ABSTRACT. The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.
11:10	Sabtain Ahmad (Vienna University of Technology, Austria) Thomas Schneidergruber (Paris-Lodron University of Salzburg, Austria) Ivona Brandic (Vienna University of Technology, Austria) Johannes Scholz (Paris-Lodron University of Salzburg, Austria) On-Device Federated Learning for Remote Alpine Livestock Monitoring ABSTRACT. Alpine livestock monitoring is critical for ecological preservation and agricultural efficiency. However, existing solutions struggle with energy constraints, limited network availability, and intermittent connectivity in remote environments. To address this, we propose an on-device federated learning framework tailored for PV-powered IoT sensors to optimize energy-communication tradeoffs. Our approach introduces staleness-aware aggregation and solar-aware training scheduling to address intermittent connectivity and PV variability in remote alpine environments. Deployed on a real-world testbed with collar sensors, the framework achieves 92\% accuracy in time-series location prediction and 89% F1-score in anomaly detection while using 68% less energy than centralized baselines.
11:30	Rubayet Rahman Rongon (Washington State University, United States) Xuechen Zhang (Washington State University, United States) IAUG: Accelerating Augmentation with Importance Sampling in Deep Neural Network Training ABSTRACT. Data augmentation is a widely used technique in deep learn- ing to enhance model accuracy. It often becomes a performance bottle- neck of deep neural network (DNN) training as they are CPU-intensive. In this paper, we propose iAug, an importance-informed augmentation framework to reduce sample augmentation time in DNN training by se- lectively applying different numbers of augmentation layers to data sam- ples based on sample importance. First, iAug uses loss distribution of samples to classify data samples during training to maximize its perfor- mance potential. Second, it constantly monitors performance loss due to importance-aware augmentation and uses the error compensation al- gorithm to adjust augmentation strategies for achieving the targeted accuracy accepted by users. Third, it opportunistically promotes low- importance samples to high-importance samples to improve data diver- sity and model accuracy. Experiments on standard datasets and DNN models show that iAug reduces preprocessing time by up to 26.2% and total training time by up to 14%, while maintaining or improving accu- racy over state-of-the-art methods, proving its effectiveness for efficient DNN training.

10:30-12:00 Session 45C: Track 5.1: Theory and Algorithms

10:30	Spyros Angelopoulos (CNRS, France) Loris Marchal (CNRS, France) Adrien Obrecht (ENS-Lyon, France) Bertrand Simon (CNRS, France) Cache Management for Mixture-of-Experts LLMs ABSTRACT. Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.
10:50	Atte Torri (Université Paris-Saclay, LISN, CNRS, France) Przemyslaw Dominikowski (Université Paris-Saclay, Inria, France) Brice Pointal (Université Paris-Saclay, LISN, CNRS, France) Oguz Kaya (Université Paris-Saclay, LISN, CNRS, France) Laercio Lima Pilla (Université de Bordeaux, CNRS, Bordeaux INP, Inria, LaBRI, France) Olivier Coulaud (Inria, France) Near-optimal contraction strategies for the scalar product in the tensor-train format ABSTRACT. Tensor-train (TT) decomposition has garnered tremendous popularity for its efficiency in handling high-dimensional data arising in scientific and quantum computing as well as machine learning applications. It provides a compact representation for matrices and vectors with a Kronecker product-like low-rank structure and enables efficient matrix-vector operations in this compressed form. The vector scalar product is among such key operations, comprising a series of tensor contractions in a specific tensor network topology whose order significantly impacts the computational cost. In this work, we propose efficient algorithms for finding near-optimal contraction orderings for tensor networks representing scalar products in the TT format. We show that our algorithms outperform all existing contraction ordering methods for general tensor networks where the best existing method incurs up to 15\% higher cost for $x^Ty$, twice the cost for $x^TAy$, and ten times higher cost for $x^TABy$ scalar products where $x, y$ and $A, B$ are vectors and matrices expressed in the TT format, respectively.
11:10	John Augustine (IIT Madras, India, India) Christian Scheideler (Paderborn University, Germany, Germany) Julian Werthmann (Paderborn University, Germany, Germany) Supervised Distributed Computing ABSTRACT. We introduce a new framework for distributed computing that extends and refines the standard master-worker approach of scheduling multi-threaded computations. In this framework, there are different roles: a supervisor, a source, a target, and a collection of workers. Initially, the source stores some instance $I$ of a computational problem, and at the end, the target is supposed to store a correct solution $S(I)$ for that instance. We assume that the computation required for $S(I)$ can be modeled as a directed acyclic graph $G=(V,E)$, where $V$ is a set of tasks and $(v,w) \in E$ if and only if task $w$ needs information from task $v$ in order to be executed. Given $G$, the role of the supervisor is to schedule the execution of the tasks in $G$ by assigning them to the workers. If all workers are honest, information can be exchanged between the workers and the workers have access to the source and target, the supervisor only needs to know $G$ to successfully schedule the computations. I.e., the supervisor does not have to handle any data itself like in standard master-worker approaches, which has the tremendous benefit that tasks can be run massively in parallel in large distributed environments without the supervisor becoming a bottleneck. But what if a constant fraction of the workers is adversarial? Interestingly, we show that under certain assumptions a data-agnostic scheduling approach would even work in such an adversarial setting without (asymptotically) increasing the work required for communication and computations. We demonstrate the validity of these assumptions by presenting concrete solutions for supervised matrix multiplication and sorting.
11:30	Anne Benoit (ENS Lyon - LIP, France) Thomas Herault (INRIA, France) Yves Robert (ENS Lyon, France) Alix Tremodeux (ENS Lyon, France) Partial Detectors Versus Replication To Cope With Silent Errors ABSTRACT. This work considers an iterative algorithm that executes on an error-prone platform. Silent errors strike at each iteration with some probability. A detector is applied as a verification mechanism before taking a checkpoint. However, this mechanism is likely not perfect, and fails to detect some errors. More precisely, an error striking at iteration $I$ will be detected only after iteration $(I-1)+X$, where $X$ is a random variable obeying a probability distribution with bounded support $[1,D]$ such as a truncated geometric distribution. Intuitively, the error silently amplifies during some iterations before it can be detected at distance $X$ or higher. As a consequence, when taking a verification before a checkpoint, there is the risk of missing an error that has struck recently but cannot be detected yet. The simplest solution is to keep two checkpoints in memory, and to use two segments of $D-1$ iterations, each followed by a verification and a checkpoint: in steady state, perform $D-1$ iterations after the last checkpoint and take a verification; (i) if successful, we can safely erase the oldest checkpoint, because the most recent checkpoint is necessarily verified. We take a new checkpoint to replace the old one, which is not verified yet; (ii) otherwise, we need to rollback to the oldest checkpoint, not to the last one which may not be verified. Can this simple scheme perform better than replication? What is the optimal number of segments (hence of checkpoints) to keep in memory, and what is the length of these segments? This work answers these questions, both theoretically and through Monte Carlo simulations.

10:30-12:00 Session 45D: Track 6.4: Graph Algorithms and Linear Algebra

10:30	Chao Wang (University of Science and Technology of China, China) Haijie Hou (University of Science and Technology of China, China) Longsheng Song (University of Science and Technology of China, China) Junshi Chen (University of Science and Technology of China, China) Hong An (University of Science and Technology of China, China) Dongdong Tan (University of Science and Technology of China, China) Yueqiang He (University of Science and Technology of China, China) Sihan Lu (University of Science and Technology of China, China) Uniform Dense Blocking for Efficient Sparse LU Factorization in First-principles Materials Simulation ABSTRACT. Sparse matrix LU factorization is a critical method in direct solvers, playing a significant role in the field of first-principles materials simulation. Matrices in quantum chemistry problems often exhibit locally dense properties, yet their spatial structural characteristics have been overlooked in previous efforts. This paper proposes a novel LU decomposition algorithm that leverages application-specific locally dense structures by partitioning sparse matrices into uniform dense blocks.Through systematic integration of level-3 BLAS kernels, the method transforms traditionally memory-bound LU operations into compute-intensive tasks, achieving significant improvements in both computational efficiency and CPU utilization. We conducted performance tests on three vendor CPUs, including the x86-based Intel Xeon Platinum 8375C and AMD EPYC 7543, and the ARM-based Kunpeng 920 New. Experimental results demonstrate significant performance improvements compared to the state-of-the-art sparse direct solvers.
10:50	Soumyajit Chatterjee (Indian Institute of Technology, Hyderabad, India) Rahul Utkoor (Qualcomm Innovation Center, Hyderabad, India) Eshwar Uppu (Indian Institute of Technology, Hyderabad, India) Sathya Peri (Indian Institute of Technlogy, Hyderabad, India) Venkata Krishna Nandivada (Indian Institute of Technology, Madras, India) Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP ABSTRACT. Efficient task scheduling is paramount in parallel programming on multi-core architectures, where tasks are fundamental computational units. QR factorization is a critical sub-routine in Sequential Least Squares Quadratic Programming (SLSQP) for solving non-linear programming (NLP) problems. QR factorization decomposes a matrix into an orthogonal matrix Q and an upper triangular matrix R, which are essential for solving systems of linear equations arising from optimization problems. SLSQP uses an in-place version of QR factorization, which requires storing intermediate results for the next steps of the algorithm. Although DAG-based approaches for QR factorization are prevalent in the literature, they often lack control over the intermediate kernel results, providing only the final output matrices Q and R. This limitation is particularly challenging in SLSQP, where intermediate results of QR factorization are crucial for back-substitution logic at each iteration. Our work introduces novel scheduling techniques using a two-queue approach to execute the QR factorization kernel effectively. This approach, implemented in high-level C++ programming language, facilitates compiler optimizations and allows storing intermediate results required by back-substitution logic. Empirical evaluations demonstrate substantial performance gains, including a 10x improvement over the sequential QR version of the SLSQP algorithm.
11:10	Florian Willich (Humbold Universität zu Berlin, Germany) Henning Meyerhenke (Humbold Universität zu Berlin, Germany) ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems ABSTRACT. Random walks (RWs) on graphs have a plethora of applications, both in theory and practice. One of the currently most important applications is representation learning (RL) -- finding a suitable embedding of a graph into some low-dimensional geometric space. The demand for fast RW algorithms lead to a variety of RW engines targeting different computing architectures. In this paper, we address multi-CPU systems and aim at improving upon existing random walk engines such as KnightKing when running first- and second-order RW algorithms. To this end, we introduce ScaleRunner, a C++ library with full CMake integration that executes random walks in parallel. Our main acceleration techniques for ScaleRunner are: (i) each random walk is modeled as a task deployed to a thread-pool, balancing the work load on each CPU separately; (ii) integration of the dynamic graph data structure DHB to speed up graph data caching operations; (iii) collective MPI I/O routines to speed up graph input, path output, and postprocessing operations. Our experiments use a variety of popular benchmark graphs to execute RW algorithms commonly used in RL applications. On average, ScaleRunner speeds up first-order RWs by one order of magnitude and second-order RWs by two orders compared to the only other MPI-based tool KnightKing.

12:00-13:30Lunch Break

13:30-14:30 Session 46A: Track 2.5: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows

13:30	Olivier Beaumont (Inria, France) Raphael Bourgouin (Inria, France) Maxime Darrin (Mistral AI, France) Loris Marchal (IRL ILLS (CNRS, McGill, ETS Montreal), Canada) Pablo Piantanida (IRL ILLS (CNRS, McGill, ETS Montreal), Canada) Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism ABSTRACT. Large language models have become indispensable for many text-processing applications. Their inference, i.e. their use to generate text, is a time-consuming task since tokens have to be generated one after the other, even if the computational load has been reduced by model sparsification, e.g. by using a Mixture of Experts (MoE) models. In the MoE context, a subset of experts is selected at each stage. Note that not all subsets of experts (pairs of experts in most cases) in a given layer have the same probability of being selected. When experts are mapped to different GPUs, there is a risk of load imbalance if the selected experts end up on a small number of GPUs. This paper proposes to leverage this heterogeneity in expert usage to map experts of popular subsets onto distinct GPUs, allowing them to be processed in parallel and thus reducing the time needed for inference. Even though this mapping problem is NP-complete, it is possible to design simple greedy strategies that significantly reduce the need for sequential expert processing. Our proof-of- concept confirms that our mapping strategies effectively reduce inference time on the Mixtral model.
13:50	Ana Gainaru (Oak Ridge National Laboratory, United States) Scott Klasky (ORNL, United States) Guillaume Pallez (INRIA, France) Priority-BF: a Task Manager for Priority-Based Scheduling ABSTRACT. The increasing demand for computational resources, particularly in HPC environments, necessitates efficient job scheduling strategies that account for varying job priorities. This work addresses the challenge of managing concurrent jobs with differing priorities on overloaded parallel systems, where strict QoS constraints are often difficult for users to define. Our solution pulls from two key approaches: the Easy-BF algorithm and the Conservative Backfilling algorithms. It incorporates job priorities, providing a transparent and adaptable solution. By allowing users to assign priority levels, our approach enables flexible resource allocation, reverting to standard Easy-BF behavior when all jobs have equal importance, or to Conservative backfilling when all jobs have different importance. This method supports multiple priority categories, facilitating a more nuanced and equitable distribution of computational resources, and enhancing the response time for high priority jobs by 50% without affecting the overall system utilization.
14:10	Joachim Cendrier (ENS Lyon, France) Rajini Wijayawardana (University of Chicago, United States) Anne Benoit (ENS Lyon, France) Yves Robert (ENS Lyon, France) Frédéric Vivien (INRIA, France) Andrew Chien (University of Chicago, United States) Green Scheduling on the Edge ABSTRACT. This work aims at designing and evaluating scheduling algorithms that minimize carbon cost on edge platforms. When a job is released to some edge server, difficult scheduling questions arise: should the job be executed on that server? If yes, when? If no, which other edge server should the job be transferred to? Typically, jobs are submitted online, and have a deadline to enforce. Online scheduling problems are already difficult without accounting for different energy sources, so one should not expect any optimal solution. Still, an important research goal is to revisit standard algorithms such as \emph{Earliest Completion Time} (ECT) and \emph{Earliest Deadline First} (EDF) in order to design and evaluate carbon-aware variants. This paper introduces several new algorithms that use sophisticated scheduling policies to efficiently decrease carbon cost; these algorithms maximize the use of green energy both on local and remote edge servers, by re-evaluating previous decisions whenever needed to accommodate newly released jobs. We provide a comprehensive simulation campaign based on actual platform/job data and carbon traces and report an average gain of 48\% over standard approaches.

13:30-14:30 Session 46B: Track 5.2: Theory and Algorithms

13:30

Chryssis Georgiou (University of Cyprus, Cyprus)
Piduguralla Manaswini (IIT hyderabad, India)
Sathya Peri (Indian Institute of Technology Hyderabad, India)

Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory

ABSTRACT. In this work, we formalize a novel shared memory model inspired by the popular GPU architecture. Within this model, we develop algorithmic solutions to the Byzantine Consensus problem.

13:50

Thomas Koopman (Radboud University, Netherlands)
Sven-Bodo Scholz (Radboud University, Netherlands)
Bernard van Gastel (Radboud University, Netherlands)

Partitioning In-Place on Massively Parallel Systems

ABSTRACT. Data partitioning, i.e.\ rearranging data according to a Boolean predicate, is needed for many performance critical operations on large data. This includes algorithms such as sorting, convex hull computations, in load balancing across a cluster, or in graph algorithms.

Several efficient implementations for data partitioning have been proposed in the literature. Some of these focus on the ability to compute the result with only a constant amount of additional memory, known as in-place implementations. Others focus on efficient parallel executions. Our described strategy of implementing an in-place algorithm on GPU-accelerated systems includes keeping the memory requirements and movements as low as possible while maintaining enough parallelism and coalesced access patterns.

We present an in-place partitioning algorithm that can be executed on massively parallel systems. Our implementation maps well to GPU architectures while moving only a negligible amount of data more than necessary, for non-adversarial input. We quantify `negligible' by providing a probabilistic bound for random input and derive a worst-case bound.

Our performance evaluation demonstrates that our algorithm achieves between 94% and 100% of peak performance on large random arrays for two different GPU architectures. Even in the worst-case scenario of small, adversarially ordered arrays, our algorithm attains 60% of peak performance.

13:30-14:30 Session 46C: Track 6.5: GPU and Quantum Systems

13:30	Massimiliano Meneghin (Autodesk Research, Italy) Ahmed Mahmoud (MIT, United States) Disaggregated Design for GPU-Based Volumetric Data Structures ABSTRACT. Volumetric data structures typically prioritize data locality, focusing on efficient memory access patterns. This singular focus can neglect other critical performance factors, such as occupancy, communication, and kernel fusion. We introduce a novel disaggregated design that rebalances trade-offs between locality and these objectives---reducing communication overhead on distributed memory architectures, mitigating register pressure in complex boundary conditions, and enabling more kernel fusion. We provide a thorough analysis of its benefits on a single-node multi-GPU Lattice Boltzmann Method (LBM) solver. Our evaluation spans dense, block-sparse, and multi-resolution discretizations, demonstrating our design's flexibility and efficiency. Leveraging this approach, we achieve up to a 3x speedup over state-of-the-art solutions.
13:50	Jiale Zhang (Jilin university, China) Xilong Che (Jilin university, China) Yuzhe Fan (Jilin university, China) Juncheng Hu (Jilin university, China) Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency ABSTRACT. In recent years, numerous studies have explored efficient methods for encoding classical data into quantum systems by leveraging Quantum Random Access Memory (QRAM) to facilitate subsequent data processing tasks. However, current state-of-the-art encoding techniques rely on extensive multi-qubit controlled-NOT gates and require complex quantum gate decompositions to ensure compatibility with existing hardware. Other approaches have attempted to employ quantum neural networks for state preparation—either to enable quantum data compression or to perform quantum Fourier transforms for preserving frequency-domain information—but these methods typically involve significant preprocessing and fail to accurately recover the original classical data. In this paper, inspired by classical Delta Encoding, we propose Quantum Delta Encoding (QDE), which stores the majority of data in a benchmark and encodes only the deviations via entanglement, thereby significantly reducing the need for entangled qubits and quantum gates during storage. Moreover, QDE can seamlessly integrate with QRAM to support subsequent quantum data processing tasks—such as image processing and data encryption—thus mitigating the additional errors and losses associated with repeated classical-to-quantum data exchanges. We evaluate the advantages of QDE over state-of-the-art models using real-world datasets and assess its robustness against quantum noise. Experiments conducted on both the IBM Quantum platform’s simulator and two real superconducting quantum computers confirm the validity and potential of the QDE approach. All codes and data are available at https://github.com/kennyZhangsky/Quantum-Delta-Encoding.git.
14:10	Yi-Hua Chung (University of Wisconsin at Madison, United States) Shui Jiang (The Chinese University of Hong Kong, Hong Kong) Wan Luan Lee (University of Wisconsin at Madison, United States) Yanqing Zhang (Nvidia Corporation, United States) Haoxing Ren (Nvidia Corporation, United States) Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong) Tsung-Wei Huang (University of Wisconsin at Madison, United States) SimPart: A Simple Yet Effective Replication-aided Partitioning Algorithm for Logic Simulation on GPU ABSTRACT. Replication-aided partitioning (RAP) has recently been introduced to facilitate the design of parallel logic simulation algorithms. By replicating overlapped work, RAP can significantly reduce the cost of inter-thread synchronization. However, the state-of-the-art RAP algorithm, RepCut, relies on time-consuming hypergraph construction and partitioning, where minimizing cut size corresponds to reducing replication. To overcome this runtime challenge, we introduce SimPart, a simple yet highly effective and efficient GPU-parallel replication-aided partitioner. SimPart tackles the partitioning problem directly without solving another proxy problem and proposes a hybrid strategy that can maximally utilize GPU threads for simulation atop our partitions. Compared to RepCut, SimPart achieves an average speedup of 23×in partitioning and 1.58×in GPU-parallel simulation, while increasing the original graph size by only 0.3%.

14:30-15:00 Session 47: Closing Session