EURO-PAR 2024: 30TH INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR WEDNESDAY, AUGUST 28TH
Days:
next day
all days

View: session overviewtalk overview

09:30-10:30 Session 1: Keynote: Mateo Valero

European supercomputers: buying versus building 

In 2017, Europe created the EuroHPC initiative and its associated legal funding structure, the “EuroHPC JU” Joint Undertaking with two main objectives. The first objective is to acquire, build and deploy world-class high performance computing (HPC) infrastructure across Europe. The second objective is to conduct research and development to build HPC hardware manufactured in Europe, as well as the applications (software) that would run on future locally developed European supercomputers.This talk will cover both objectives in detail. On the one hand, Europe has recently committed a substantial amount of money to the first goal. For example, in the June 2024 Top-500 list, 9 of the Top-20 supercomputers are from Europe. We will go deeper and describe the two main components of the heterogeneous MareNostrum 5 supercomputer, listed separately in positions 8 and 22 of the June 2024 Top-500. Installed at our Barcelona site, MareNostrum 5 represents a good illustration of the challenges of building a contemporary supercomputer; for example, space requirements dictated that BSC could no longer implement it within our Church. Therefore, the MareNostrum 5 had to be installed in a larger space; while the Church will be used to install our first Quantum Computer, thus fulfilling the prophecy made by Dan Brown in his book "Origin".

On the other hand, and as the second part of my talk, I will describe the European approach to design general Made-in-Europe processors and accelerators leveraging the RISC V Open Instruction Set Architecture (ISA).  Currently, this approach is embodied in a couple of large-scale European research projects, namely the European Processor Initiative, EUPilot, Eprocessor, as well as some nationally funded projects. I will briefly describe these projects, including the proof-of-concept chips that successfully boot Linux. I will briefly hint at the future and describe the initiatives that Europe and the BSC are pursuing with the main goal of developing software and hardware for the MareNostrum 6 supercomputer that should be a reality in 2027-2028.

Location: Auditorium
10:30-11:00Coffee Break
11:00-13:00 Session 2: Best Paper Candidates Session
Location: Auditorium
11:00
Milo Lurati (VU Amsterdam, Netherlands)
Stijn Heldens (Netherlands eScience Center, Netherlands)
Alessio Sclocco (Netherlands eScience Center, Netherlands)
Ben van Werkhoven (Leiden University, Netherlands)
Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Artifact)

ABSTRACT. Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices has hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD's Heterogeneous-computing Interface for Portability (HIP). We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty of auto-tuning four highly-tunable benchmark kernels on four different GPUs: two from Nvidia, and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (13x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on AMD GPUs.

11:20
Olivier Beaumont (Univ. Bordeaux, CNRS, Bordeaux INP, Inria, LaBRI, UMR 5800, Talence, France, France)
Rémi Bouzel (Qarnot Computing, Montrouge, France, France)
Lionel Eyraud-Dubois (Univ. Bordeaux, CNRS, Bordeaux INP, Inria, LaBRI, UMR 5800, Talence, France, France)
Esragul Korkmaz (Univ. Bordeaux, CNRS, Bordeaux INP, Inria, LaBRI, UMR 5800, Talence, France, France)
Laercio Pilla (Univ. Bordeaux, CNRS, Bordeaux INP, Inria, LaBRI, UMR 5800, Talence, France, France)
Alexandre Van Kempen (Qarnot Computing, Montrouge, France, France)
A 1.25(1+ε)-Approximation Algorithm for Scheduling with Rejection Costs Proportional to Processing Times (Artifact)
PRESENTER: Esragul Korkmaz

ABSTRACT. We address an offline job scheduling problem where jobs can either be processed on a limited supply of energy-efficient machines, or offloaded to energy-inefficient machines (with an unlimited supply), and the goal is to minimize the total energy consumed in processing all tasks. This scheduling problem can be formulated as a problem of scheduling with rejection, where rejecting a job corresponds to process it on an energy-inefficient machine and has a cost directly proportional to the processing time of the job. To solve this scheduling problem, we introduce a novel 1.25(1+ε) approximation algorithm BEKP by associating it to a Multiple Subset Sum problem. Our algorithm is an improvement over the existing literature, which provides a (1.5 - 1/2m) approximation for scheduling with arbitrary rejection costs. We evaluate and discuss the effectiveness of our approach through a series of experiments, comparing it to existing algorithms.

11:40
Hamidreza Ramezanikebrya (University of British Columbia, Canada)
Matei Ripeanu (University of British Columbia, Canada)
(re)Assessing PiM Effectiveness for Sequence Alignment
PRESENTER: Matei Ripeanu

ABSTRACT. Processing-in-Memory (PiM) technology has emerged as a promising solution for high-performance applications that hit the "memory wall". This paper (re)examines the PiM effectiveness in sequence alignment, a pivotal bottleneck in genome analysis and an application context that ideally matches PiM strengths as it offers ample parallelism and is memory intensive. While simulation has been the methodology used by numerous past attempts to evaluate PiM effectiveness, we use commercially-available PiM hardware (UPMEM) for an exploration based on direct implementation and performance analysis.

Our results show that, contrary to existing literature, minimally optimized implementations of novel alignment algorithms, such as WFA \cite{wfa}, outperform highly-optimized code targeting the UPMEM architecture up to $4.77\times$ and $2.72\times$ in terms of throughput and power consumption respectively. This suggests that, while the avenue offered by PiM is potentially promising, existing technologies are not yet mature enough to replace CPU platforms for bioinformatics tasks.

12:00
Thorsten Wittkopp (Technische Universität Berlin, Germany)
Philipp Wiesner (Technische Universität Berlin, Germany)
Odej Kao (Technische Universität Berlin, Germany)
LogRCA: Log-based Root Cause Analysis for Distributed Services

ABSTRACT. To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure.

We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts. LogRCA consistently outperforms baselines based on deep learning and statistical analysis in terms of precision and recall to detect candidate root causes. In addition, we investigated the impact of our deployed data balancing approach, demonstrating that it considerably improves performance on rare failures.

12:20
Kåre von Geijer (Chalmers University of Technology, Sweden)
Philippas Tsigas (Chalmers University of Technology, Sweden)
How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures (Artifact)
PRESENTER: Kåre von Geijer

ABSTRACT. The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions.

In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral, we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run-time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.

12:40
Richard Angersbach (Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany)
Sebastian Kuckuk (Friedrich-Alexander-Universität Erlangen-Nürnberg, National High Performance Computing Center, Germany)
Harald Köstler (Friedrich-Alexander-Universität Erlangen-Nürnberg, National High Performance Computing Center, Germany)
Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication

ABSTRACT. This paper presents a novel method designed to generate multigrid solvers optimized for octree-based software frameworks. Our approach focuses on accurately capturing local features within a domain while leveraging the efficiency inherent in multigrid techniques. We outline the essential steps involved in generating specialized kernels for local refinement and communication routines which integrate on-the-fly interpolations to seamlessly transfer information between refinement levels. The generated numerical solvers and communication routines are automatically specialized for coupling with existing implementations of complex octree data structures and algorithms that are often found in established HPC frameworks. We demonstrate the effectiveness of our method through numerical experiments with different interpolation orders as well as with large-scale benchmarks on the SuperMUC-NG cluster. A comparison against a manual reference implementation highlights the benefits of our method and code generation in general.

13:00-14:00Lunch Break
15:00-15:30Coffee Break
15:30-17:30 Session 4A: Architectures and Accelerators (I)
Location: -1.A.02
15:30
Marius Meyer (Paderborn University, Germany)
Tobias Kenter (Paderborn University, Germany)
Kenneth O'Brien (AMD Research, Ireland)
Lucian Petrica (AMD Research, Ireland)
Michaela Blott (AMD Research, Ireland)
Christian Plessl (Paderborn University, Germany)
Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
PRESENTER: Marius Meyer

ABSTRACT. Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.

15:50
Keegan Sanchez (Washington State University Vancouver, United States)
Alex Gavin (Washington State University Vancouver, United States)
Suren Byna (The Ohio State University, United States)
Kesheng Wu (Lawrence Berkeley National Laboratory, United States)
Xuechen Zhang (Washington State University Vancouver, United States)
A High-Performance Collective I/O Framework Leveraging Node-Local Persistent Memory
PRESENTER: Keegan Sanchez

ABSTRACT. Collective I/Os are widely used to transform small, non-contiguous accesses into large, contiguous accesses for parallel I/O optimization. The existing collective I/O techniques were proposed with the assumption that computer memory is volatile. However, their ability is limited by the size of collective I/O buffers and communication overhead. In this paper, we propose PMIO, a novel collective I/O framework that employs node-local persistent memory on compute nodes for I/O optimization of HPC applications. First, it uses a log-structured buffer to achieve a high bandwidth of persistent memory and enforce crash consistency. Given this design, we can safely enlarge the size of collective I/O buffers without losing data when failures happen. Second, being less space-constrained than with more expensive DRAM, PMIO can buffer data across multiple collective I/O calls before writing them back to parallel file systems to further improve I/O performance. Third, we design a two-level log merging approach to reduce communication overhead for data shuffling among MPI processes on compute nodes. Our experimental results with representative MPI-IO benchmarks show that PMIO improves the I/O throughput by up to 121X and 151X for writes and reads respectively on the Perlmutter supercomputer.

16:10
Yunkun Liao (SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Zhongguancun Laboratory;, China)
Jingya Wu (SKLP, Institute of Computing Technology, CAS, China)
Wenyan Lu (SKLP, Institute of Computing Technology, CAS; YUSUR Tech Co., Ltd, China)
Xiaowei Li (SKLP, Institute of Computing Technology, CAS; Zhongguancun Laboratory, China)
Guihai Yan (SKLP, Institute of Computing Technology, CAS; YUSUR Tech Co., Ltd, China)
Efficient RNIC Cache Side-channel Attack Detection through DPU-driven Architecture
PRESENTER: Yunkun Liao

ABSTRACT. Remote Direct Memory Access (RDMA) is increasingly favored for its high-bandwidth and low-latency communication capabilities in cloud environments. However, traditional RDMA network interface cards (RNICs) act as passive remote memory controllers and are susceptible to cache side-channel attacks. The Data Processing Unit (DPU) represents the latest evolution of the RNIC, integrating programmable logic with the RDMA engine. In this work, we propose a paradigm shift in detecting RNIC cache side-channel attacks from network-core programmable switches to end-host DPU. By leveraging the programmable logic of the DPU to accelerate cache-side channel attack detection, we demonstrate significant potential for enhanced performance and scalability. Our evaluation results show that the DPU-driven architecture can reduce detection latency by up to 98.7% compared to the state-of-the-art switch-centric architecture.

16:30
Pedro Rigon (Institute of Informatics - UFRGS, Brazil)
Brenda Schussler (Institute of Informatics - UFRGS, Brazil)
Alexandre Sardinha (Petrobras, Brazil)
Pedro Mario Silva (NVIDIA, Brazil)
Fábio Alves de Oliveira (NVIDIA, Brazil)
Alexandre Carissimi (INF/UFRGS, Brazil)
Jairo Panetta (ITA, Brazil)
Arthur Lorenzon (Federal University of Rio Grande do Sul, Brazil)
Philippe Navaux (UFRGS, Brazil)
Harnessing Data Movement Strategies to Optimize Performance-Energy Efficiency of Oil & Gas Simulations in HPC
PRESENTER: Arthur Lorenzon

ABSTRACT. The computing demands of Oil & Gas exploration applications have increased over the years. As a result, graphic processing units (GPUs) are becoming an essential resource due to their high-performance capabilities. At the same time, energy consumption has emerged as a significant challenge as the power requirements of such systems are increasing according to data needs. Because Oil & Gas applications have many data exchange points between CPU and GPU memories to keep data updated during the execution, improving the communication task is essential to enhance the trade-off between performance and energy consumption, represented by the energy-delay product (EDP). Hence, in this paper, we employ four different data movement optimization strategies on an RTM application to improve its performance, energy, and EDP. Through extensive experiments over five different NVIDIA GPU generations (from Pascal to GH200), we show that employing the right data movement strategy can improve performance by 62.2% and EDP by 78.1%. We also show that advances in the software and hardware layer of NVIDIA GPUs over generations are improving the unified memory technique in terms of performance, energy, and EDP.

15:30-17:30 Session 4B: Theory and Algorithms (I)
Location: -1.A.03
15:30
Andrzej Lingas (Lund University, Sweden)
Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique

ABSTRACT. We prsesent a protocol for the Boolean matrix product of two $n\times b$ Boolean matrices on the congested clique designed for the situation when the rows of the first matrix or the columns of the second matrix are highly clustered in the space $\{0,1\}^n.$ It uses $\tilde{O}\left(\sqrt {\frac M n+1}\right)$ rounds on the congested clique with $n$ nodes, where $M$ is the minimum of the cost of an MST of the rows of the first input matrix and the cost of an MST of the columns of the second input matrix in the Hamming space $\{0,1\}^n.$ A key step in our protocol is the computation of an approximate minimum spanning tree of a set of $n$ points in the space $\{0,1\}^n$. We provide a protocol for this problem (of interest in its own rights) based on a known randomized technique of dimension reduction in Hamming spaces. It constructs an $O(1)$-factor approximation of an MST of $n$ points in the Hamming space $\{ 0,\ 1\}^n$ using $O(\log^2 n)$ rounds on the congested clique with $n$ nodes.

15:50
Eunji Lee (Sogang University, South Korea)
Yoonsang Han (Sogang University, South Korea)
Gordon Moon (Sogang University, South Korea)
Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication (Artifact)
PRESENTER: Eunji Lee

ABSTRACT. Sparse Matrix-Multivector (SpMM) multiplication is a key kernel in deep learning models and scientific computing applications. However, achieving high performance for SpMM is challenging due to the irregular distribution of non-zero elements and memory access patterns. Therefore, several sparse matrix reordering algorithms have been developed to improve data locality for SpMM. However, existing approaches for reordering sparse matrix have not considered block sparsity during the reordering process. In this paper, we present a novel algorithm for sparse matrix reordering that considers block sparsity to enhance data locality for SpMM on Tensor Cores. To alleviate the main bottleneck of reordering, which involves substantial computations for measuring similarity between rows, we develop an efficient GPU implementation by adapting dynamic parallelism and synchronization schemes. Experimental results on a large number of sparse matrices demonstrate the effectiveness of our reordering algorithm and the benefits of leveraging Tensor Cores for SpMM. Our approach achieves a significant performance improvement over various state-of-the-art SpMM implementations.

16:10
Roy Nissim (The Hebrew University of Jerusalem, Israel)
Oded Schwartz (The Hebrew University, Israel)
Yuval Spiizer (Tel-Aviv University, Israel)
Communication Minimizing Toom-Cook Algorithms
PRESENTER: Yuval Spiizer

ABSTRACT. Long integer multiplication is a fundamental kernel in many linear algebra and cryptography computations. Toom-Cook-$k$ ($k \in \mathbb{N}$) are a family of fast long integer multiplication algorithms frequently used in many applications, particularly for small $k$ sizes ($2$, $3$, and $4$). Previous studies focus on minimizing Toom-Cook's arithmetic cost, sometimes at the expense of asymptotically higher communication costs and memory footprint. For many high-performance computing applications, the bottleneck is communication rather than arithmetic. We propose new versions of Toom-Cook-$k$ algorithms that simultaneously reduce their arithmetic cost, communication cost, and memory footprint. We obtain these results by utilizing the alternative basis and Toom-Graph techniques. The arithmetic costs of the new algorithms are only slightly higher than that of the best previous solution, while the communication costs and memory footprint are significantly reduced and proved to be optimal.

16:30
Stef Graillat (Sorbonne Université, CNRS, LIP6, Paris, France, France)
Fabienne Jézéquel (Sorbonne Université, CNRS, LIP6, Paris, France, France)
Théo Mary (Sorbonne Université, CNRS, LIP6, Paris, France, France)
Roméo Molina (Sorbonne Université, CNRS, LIP6, Paris, France, France)
Daichi Mukunoki (--, Japan)
Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
PRESENTER: Roméo Molina

ABSTRACT. Mixed precision algorithms aim at taking advantage of the performance of low precisions while maintaining the accuracy of high precision. In particular adaptive precision algorithms dynamically adapt at runtime the precisions used for different variables or operations. For example Graillat et al (2023) have proposed an adaptive precision sparse matrix-vector product (SpMV) which stores the matrix elements in a precision inversely proportional to their magnitude. In theory, this algorithm can therefore make use of a large number of different precisions, but the practical results previously obtained only achieved high performance using natively supported double and single precisions. In this work we combine this algorithm with an efficient memory accessor for custom reduced precision formats (Mukunoki et al. 2016). This allows us to experiment with a large set of different precision formats with fine variations of the number of bits dedicated to the significand. Moreover we also explore the possibility to reduce the number of bits dedicated to the exponent using the fact that the elements that share the same precision format are of similar magnitude. We experimentally evaluate the performance of using four or seven different custom formats using reduced precision and possibly reduced exponent, and demonstrate their effectiveness compared with the existing version only using double and single precisions.

15:30-17:30 Session 4C: Multidisciplinary, Domain-Specific and Applied Parallel and Distributed Computing (I)
Location: -1.A.04
15:30
Jiajun Song (Tsinghua University, China)
Jiajun Luo (Southern University of Science and Technology, China)
Rongwei Lu (Tsinghua University, China)
Shuzhao Xie (Tsinghua University, China)
Bin Chen (Harbin Institute of Technology, Shenzhen, China)
Zhi Wang (Tsinghua University, China)
A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
PRESENTER: Jiajun Song

ABSTRACT. Asynchronous Federated Learning (AFL) confronts inherent challenges arising from the heterogeneity of devices (e.g., their computation capacities) and low-bandwidth environments, both potentially causing stale model updates (e.g., local gradients) for global aggregation. Traditional approaches mitigating the staleness of updates typically focus on either adjusting the local updating or gradient compression, but not both. Recognizing this gap, we introduce a novel approach that synergizes local updating with gradient compression. Our research begins by examining the interplay between local updating frequency and gradient compression rate, and their collective impact on convergence speed. The theoretical upper bound shows that the local updating frequency and gradient compression rate of each device are jointly determined by its computing power, communication capabilities and other factors. Building on this foundation, we propose an AFL framework called FedLuck that adaptively optimizes both local update frequency and gradient compression rates. Experiments on image classification and speech recognization show that FedLuck reduces communication consumption by 56% and training time by 55% on average, achieving competitive performance in heterogeneous and low-bandwidth scenarios compared to the baselines.

15:50
Cristian Tatu (Barcelona Supercomputing Center, Spain)
Javier Conejero (Barcelona Supercomputing Center, Spain)
Fernando Vazquez (Barcelona Supercomputing Center, Spain)
Rosa M. Badia (Barcelona Supercomputing Center, Spain)
GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework
PRESENTER: Cristian Tatu

ABSTRACT. In this paper, we propose a novel GPU cache system for COMPSs, a task-based distributed computing framework that enables the execution of parallel applications on heterogeneous clusters. GPU COMPSs tasks can exploit the computational power of NVIDIA GPUs to process large data blocks. However, the current implementation of COMPSs requires each task to write its output data to disk and the subsequent tasks to read them from disk, which introduces significant overhead. To overcome this limitation, we design and implement a GPU cache system that allows tasks to store and retrieve data from the GPU memory, avoiding unnecessary disk operations and reducing data trans- fer time. We conducted extensive experiments on several benchmarks and demonstrated that our GPU cache system can achieve significant speedups compared to the baseline COMPSs implementation.

16:10
Zhuoyao Huang (State Key Laboratory of Mobile Network and Mobile Multimedia Technology; ZTE Co., Ltd, China)
Nan Zhang (Southern University of Science and Technology (SUSTech), Shenzhen, China, China)
Jingran Shen (Southern University of Science and Technology (SUSTech), Shenzhen, China, China)
Georgios Diamantopoulos (Southern University of Science and Technology (SUSTech), Shenzhen, China; University of Birmingham, Birmingham, UK, China)
Zhengchang Hua (Southern University of Science and Technology (SUSTech), Shenzhen, China; University of Leeds, Leeds, UK, China)
Nikos Tziritas (University of Thessaly, Volos, Greece, Greece)
Georgios Theodoropoulos (SUSTech, China)
Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
PRESENTER: Nan Zhang

ABSTRACT. Digital Twin technology facilitates the monitoring and online analysis of large-scale communication networks. Faster predictions of network performance thus become imperative, especially for analysing Quality of Service (QoS) parameters in large-scale city networks. Dis- crete Event Simulation (DES) is a standard network analysis technology, and can be further optimised with parallel and distributed execution for speedup, referred to as Parallel Discrete Event Simulation (PDES). However, modelling detailed QoS mechanisms such as DiffServ requires complex event handling for each network router, which can involve excessive simulation events. In addition, current PDES for network analysis mostly adopts conservative scheduling, which suffers from excessive global synchronisation to avoid causality problems. The performance analysis of optimistic PDES for real-world large-scale network topology and complex QoS mechanisms is still inadequate. To address these gaps, this paper proposes a simulation toolkit, Quaint, which leverages an optimistic PDES engine ROSS, for detailed modelling of DiffServ-based networks. A novel event-handling model for each network router is also proposed to significantly reduce the number of events in complex QoS modelling. Quaint has been evaluated using a real-world metropolitan-scale network topology with 5,000 routers/switches. Results show that compared to the conventional simulator OMNeT++/INET, even the sequential mode of Quaint can achieve a speedup of 53 times, and the distributed mode has a speedup of 232 times. Scalability characterisation is conducted to portray the efficiency of distributed execution, and the results indicate the future direction for workload-aware model partitioning.

16:30
Subhajit Sahu (International Institute of Information Technology Hyderabad, India, India)
Kishore Kothapalli (International Institute of Information Technology Hyderabad, India, India)
Hemalatha Eedi (JNTUH College of Engineering Hyderabad, India, India)
Sathya Peri (Indian Institute of Technology Hyderabad, India, India)
DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs (Artifact)
PRESENTER: Subhajit Sahu

ABSTRACT. PageRank is a widely used centrality measure that assesses the significance of vertices in a graph. Efficiently updating PageRank on dynamic graphs is essential for various applications due to the increasing scale of datasets. This paper introduces our Dynamic Frontier (DF) and Dynamic Frontier with Pruning (DF-P) approaches.4 Given a batch update comprising edge insertions and deletions, these approaches iteratively identify vertices likely to change their ranks with minimal overhead. On a server featuring a 64-core AMD EPYC-7742 processor, our approaches outperform Static and Dynamic Traversal PageRank by 5.2x/15.2x and 1.3x/3.5x respectively - on real-world dynamic graphs, and by 7.2x/9.6x and 4.0x/5.6x on large static graphs with random batch updates. Our approaches scale at a rate of 1.8x/1.7x for every doubling of threads.

16:50
Tiago Carneiro (IMEC, Belgium)
Engin Kayraklioglu (HPE, United States)
Guillaume Helbecque (university of lille, France)
Nouredine Melab (INRIA Lille, France)
Investigating Portability in Chapel for Tree-based Optimization on GPU-powered Clusters
PRESENTER: Tiago Carneiro

ABSTRACT. The Top500 list features supercomputers powered with accelerators from different vendors. This variety brings, along with the heterogeneity challenge, both the code and performance portability challenges. In this context, Chapel's native GPU support comes as a solution for code portability between different vendors. In this paper, we investigate the viability of using the Chapel high-productivity language as a tool to achieve both code and performance portability in large-scale tree-based search. As a case study, we implemented a distributed backtracking for solving permutation combinatorial problems. Extensive experiments conducted on big N-Queens problem instances, using up to 512 NVIDIA GPUs and 1024 AMD GPUs on Top500 supercomputers, reveal that it is possible to scale on the two different systems using the same tree-based search written in Chapel. This trade-off results in a performance decrease of less than 10% for the biggest problem instances.

15:30-17:30 Session 4D: Data analytics, AI, and Computational Science (I)
Location: -1.A.06
15:30
Krishna Teja Chitty-Venkata (Argonne National Laboratory, United States)
Sanjif Shanmugavelu (Groq Inc., UK)
Varuni Katti Sastry (Argonne National Laboratory, United States)
Murali Emani (Argonne National Laboratory, United States)
Venkatram Vishwanath (Argonne National Laboratory, United States)
Sylvia Howland (Cerebras Systems, United States)
WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators
PRESENTER: Murali Emani

ABSTRACT. Large Language Models (LLMs), such as the Generative Pre-trained Transformers (GPTs), have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient on the hardware. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad) pruning, to obtain smaller LLMs from large pre-trained models. We first investigate the level of granularity at which structured pruning techniques can be applied to a transformer. Next, we identify the challenges associated with applying these techniques across different parts of the transformer architecture. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on SOTA LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B) and Mistral-7B models across several language benchmarks. We show that we can prune close to 20% of the original model size without compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2 and Graphcore Bow Pod64 systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.

15:50
Hewang Nie (School of Cyber Science and Engineering, Huazhong university of Science and Technology, China)
Songfeng Lu (School of Cyber Science and Engineering, Huazhong university of Science and Technology, China)
Mu Wang (Clinical Research Institute, Affiliated South China Hospital, University of South China, China)
Jue Xiao (School of Cyber Science and Engineering, Huazhong university of Science and Technology, China)
Zhi Lu (School of Cyber Science and Engineering, Huazhong university of Science and Technology, China)
Zepu Yi (School of Cyber and Space Security, Huazhong University of Science and Technology, China)
VeriChroma: Ownership Verification for Federated Models via RGB Filters
PRESENTER: Hewang Nie

ABSTRACT. Big data has significantly propelled the advancement of artificial intelligence (AI), notably in deep learning domains. Yet, the resource-intensive nature of training deep neural networks (DNNs) underscores the critical need for model protection and ownership assertion. Although neural network model watermarking offers a solution, its applicability is limited in federated learning scenarios. This paper introduces VeriChroma, a pioneering framework designed to safeguard DNN models and establish ownership in these contexts. VeriChroma allows individual clients to independently embed and verify private ID-based watermarks, facilitating straightforward ownership claims. It innovatively addresses client constraint conflicts through image blocking and position mapping, guaranteeing unique watermark integration for each participant. Additionally, VeriChroma employs RGB filters to create watermark triggers, enhancing both robustness and secrecy. Our experimental results validate VeriChroma's efficacy and practicality, demonstrating its superior capability in securing DNN model ownership, mitigating federated learning disputes, and providing robust, discreet watermarking. Ultimately, VeriChroma represents a significant stride toward advanced security and intellectual property protection in federated learning environments.

16:10
Yuxiang Zhang (SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China)
Xin Liu (SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China)
Meng Wu (SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China)
Mingyu Yan (SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China)
Wei Yan (SKLP, Institute of Computing Technology, Chinese Academy of Sciences; Zhongguancun Laboratory, China)
Xiaochun Ye (SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China)
Dongrui Fan (SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China)
Disttack: Graph Adversarial Attacks Toward Distributed GNN Training
PRESENTER: Yuxiang Zhang

ABSTRACT. Graph Neural Networks (GNNs) have emerged as potent models for graph learning. Distributing the training process across multiple computing nodes is the most promising solution to address the challenges of ever-growing real-world graphs. However, current adversarial attack methods on GNNs neglect the characteristics and applications of the distributed scenario, leading to suboptimal performance and inefficiency in attacking distributed GNN training.

In this study, we introduce Disttack, the first framework of adversarial attacks for distributed GNN training that leverages the characteristics of frequent gradient updates in a distributed system. Specifically, Disttack corrupts distributed GNN training by injecting adversarial attacks into one single computing node. The attacked subgraphs are precisely perturbed to induce an abnormal gradient ascent in backpropagation, disrupting gradient synchronization between computing nodes and thus leading to a significant performance decline of the trained GNN. We evaluate Disttack on four large real-world graphs by attacking five widely adopted GNNs. Compared with the state-of-the-art attack method, experimental results demonstrate that Disttack amplifies the model accuracy degradation by 2.75x and achieves speedup by 17.33x on average while maintaining unnoticeability.

16:30
Pranjal Naman (Indian Institute of Science, India)
Yogesh Simmhan (Indian Institute of Science, India)
Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
PRESENTER: Pranjal Naman

ABSTRACT. Graph neural networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model on decentralized data, addressing privacy concerns while leveraging parallelism. Existing methods that address the unique requirements of federated GNN training using remote embeddings to enhance convergence accuracy are limited by their diminished performance due to large communication costs. In this paper, we present OpES an optimized federated GNN training framework that uses remote neighbourhood pruning and overlaps pushing of embeddings with local training to reduce the network and training costs. The modest drop in per-round accuracy is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, achieving up to ∼ 2× faster convergence than state-of-the-art techniques using embedding servers and up to 20% better accuracy than vanilla federated GNN learning.

16:50
Guangyao Zhou (University of Electronic Science and Technology of China, China)
Haocheng Lan (University of Electronic Science and Technology of China, China)
Yuanlun Xie (University of Electronic Science and Technology of China, China)
Wenhong Tian (University of Electronic Science and Technology of China, China)
Jiahong Qian (Huawei Technologies Co. Ltd, China)
Teng Su (Huawei Technologies Co. Ltd, China)
CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
PRESENTER: Guangyao Zhou

ABSTRACT. Parallel training of large-scale networks has attracted the attention of both artificial intelligence and high-performance distributed systems. One of efficient parallelism is the micro-batch-based pipeline, e.g., GPipe. Based on the GPipe, we derive a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as treats these time as nonlinear to batch size. Focusing on the optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove IMD has appreciable theoretical optimality. Also extensive experiments on both CNN- and Transformer-based networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under GPipe parallelism: CSIMD achieves training speeds respectively $2.0\times$ and $2.5\times$ faster than GPipe-R and GPipe-E in CNNs; as well as $1.5\times$ and $1.6\times$ in Transformers.