EURO-PAR 2024: 30TH INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR THURSDAY, AUGUST 29TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 6: Keynote: Franck Cappello

AuroraGPT: Rationale, Challenges and Development of an AI Research Assistant

Innovative methods, new instruments, disruptive techniques, and groundbreaking technologies have led to significant leaps in scientific progress. The increasingly powerful Large Language Models (LLMs) released each month have already sped up research activities such as concept explanation, literature search, and summarization. The transformative potential of AI in research activities, in particular, foundation models, raises important questions about their performance in science activities, their potential application in different contexts, and their ethics.  In this talk, I will first explore the notion of AI research assistants and then discuss the gap between an ideal AI research assistant and the current LLMs, focusing on HPC and parallel computing research problems. The gap motivates the development of research-oriented LLMs. AuroraGPT is developed as an open foundation model trained specifically with scientific data to explore solutions toward the realization of effective AI research assistants. I will describe the activity, challenges, and progress of the different groups developing the key aspects of AuroraGPT. I will particularly focus on the critical and hard task of LLMs' scientific skills,  safety, and trustworthiness evaluation.

Location: Auditorium
10:00-10:30Coffee Break
10:30-12:30 Session 7A: Architectures and Accelerators (II)
Location: -1.A.04
10:30
FakeGuard: A Novel Accelerator Architecture for Deepfake Detection Networks
PRESENTER: Xingbin Wang

ABSTRACT. Deepfake technologies are evolving rapidly, the malicious or illegal uses of deepfakes have raised serious threats in societal, political, and business fields. To tackle such threats, the latest and most effective methods combine the advantages of Vision transformer and CNN to detect deepfakes. However, deploying the deepfake detection networks on the existing CNN and transformer accelerators faces several critical issues such as the inefficiency and underutilization. We propose a novel accelerator architecture for deepfake detection networks, FakeGuard, which enables a cost-effective, high performance hybrid network execution. It features a flexible computing engine with reconfigurable adder tree to support for different computing patterns within a single hardware architecture. A finegrained, dependency-free task scheduling mechanism is designed to maximize hardware resources utilization. Extensive experiments show that FakeGuard surpasses the state-of-art accelerators.

10:50
ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
PRESENTER: Hongbing Tan

ABSTRACT. SIMD extensions instruction sets are widely adopted in modern processors, leveraging the data-level parallelism to provide significant performance speedups. However, the throughput of SIMD instructions relies on the number of vector processing units, the scalar processing units are underutilized in performing SIMD operations. This paper presents a SIMD processor architecture, ImSPU, which enables implicit sharing of computation resources between scalar and vector processing units. This architecture effectively leverages the scalar processing unit for SIMD operations, thereby boosting parallel computing capabilities. Unlike explicit sharing,which requires the issuance of scalar instructions, implicit sharing remains transparent to software developers and compilers, eliminating the need for code modifications. Furthermore, implicit sharing necessitates minor modifications to the processor architecture and is straightforward to implement in hardware. In comparison to traditional SIMD processors, the proposed implicit sharing architecture achieves substantial performance gains while incurring minor hardware overhead.

11:10
ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
PRESENTER: Dengke Han

ABSTRACT. Heterogeneous Graph Neural Networks (HGNNs) have recently demonstrated great power in handling heterogeneous graph data, rendering them widely applied in many critical real-world domains. Most HGNN models leverage attention mechanisms to significantly improve model accuracy, albeit at the cost of increased computational complexity and memory bandwidth requirements. Fortunately, the attention disparity from source vertices towards a common target vertex unveils an opportunity to boost the model execution performance by pruning unimportant source vertices during neighbor aggregation.

In this study, we commence with a quantitative analysis of the attention disparity in HGNN models, where the importance of different source vertices varies and exhibits a highly concentrated distribution for the same target vertex. To fully exploit this finding, we propose a runtime pruning method based on min-heap and map it to a dedicated hardware pruner to discard unimportant vertices. Given that the pruning overhead itself is significant and cannot be amortized by conventional staged execution paradigm, an operation-fusion execution flow of HGNNs is introduced to overlap the pruning overhead while harnessing inter-stage parallelism. Finally, we present the design of a novel HGNN accelerator, ADE-HGNN, tailored to support the proposed execution framework. Our experimental results demonstrate that ADE-HGNN achieves an average performance improvement of 28.21x over the NVIDIA GPU T4 platform and 7.98x over the advanced GPU A100, with the inference accuracy loss kept within a negligible range of 0.11%~1.47%. Furthermore, ADE-HGNN significantly reduces energy consumption to 1.97% and 5.37% of the two platforms, respectively.

11:30
Fault tolerant in the Expand Ad-Hoc parallel file system (Artifact)

ABSTRACT. In the last years, applications related to Artificial Intelligence and big data, among others, have been involved. There is a need to improve I/O operations to avoid bottlenecks in accessing a larger amount of data. For this purpose, the Expand Ad-Hoc parallel file system is being designed and developed. Since these applications have very long execution times, fault tolerance mechanisms in the file system are necessary to allow them to continue running in the presence of failures. This work introduces a fault-tolerant design based on data replication for the Expand Ad-Hoc parallel file system and an initial evaluation conducted on the HPC4AI Laboratory supercomputer in Torino. The evaluation of Expand Ad-Hoc with fault-tolerant found that, despite data replication, its performance and scalability are generally better than those of other parallel file systems without fault-tolerant.

11:50
Parallel Writing of Nested Data in Columnar Formats (Artifact)
PRESENTER: Jonas Hahnfeld

ABSTRACT. High Energy Physics (HEP) experiments, for example at the Large Hadron Collider (LHC) at CERN, store data at exabyte scale in sets of files. They use a binary columnar data format by the ROOT framework, that also transparently compresses the data. In this format, cells are not necessarily atomic but they may contain nested collections of variable size. The fact that row and block sizes are not known upfront makes it challenging to implement efficient parallel writing. In particular, the data cannot be organized in a regular grid where it is possible to precompute indices and offsets for independent writing. In this paper, we propose a scalable approach to efficient multithreaded writing of nested data in columnar format into a single file. Our approach removes the bottleneck of a single writer while staying fully compatible with the compressed, columnar, variably row-sized data representation. We discuss our design choices and the implementation of scalable parallel writing for ROOT's RNTuple format. An evaluation of our approach shows perfect scalability only limited by storage bandwidth for a synthetic benchmark. Finally we evaluate the benefits for a real-world application of dataset skimming.

10:30-12:30 Session 7B: Data analytics, AI, and Computational Science (II)
Location: -1.A.06
10:30
Improving Generalization and Personalization in Long-Tailed Federated Learning via Classifier Retraining
PRESENTER: Yuhang Li

ABSTRACT. Extensive research has been dedicated to studying the substantial challenge posed by non-IID data, which hinders the performance of federated learning (FL), a popular distributed learning paradigm. However, a notable challenge encountered by current FL algorithms in real-world applications is the presence of long-tailed data distributions. This issue often results in inadequate model accuracy when dealing with rare but crucial classes in classification tasks. To cope with this, recent studies have proposed various classifier retraining (CR) approaches. Though effective, they lack a deep understanding of how these methods affect the classifier's performance. In this work, we first present a systematic study informed by mutual information indicators in FL. Based on this study, we propose a novel and effective CR method for FL scenarios, coined CRFDC, to address non-IID and long-tailed data challenges. Extensive experiments on standard FL benchmarks show that CRFDC can improve the model accuracy by up to 8.16% in generalization and 10.02% in personalization, as compared to the state-of-the-art approaches.

10:50
FLUK: Protecting Federated Learning against Malicious Clients for Internet of Vehicles
PRESENTER: Mengde Zhu

ABSTRACT. Federated Learning (FL) is a distributed machine learning paradigm, which has recently been applied in Internet of Vehicles (IoVs), forming FL-IoV networks. In such real-world scenarios, poisoning attacks emerge as a non-negligible issue, in which malicious clients send corrupted updates to the central server to sabotage overall model performance. Numerous existing defenses against poisoning attacks fail under the IoV setting where data is strictly restricted on board, and are unable to detect hybrid attacks. To address these concerns, we present FLUK (protecting Federated Learning Utilizing Kullback-Leibler divergence), a detection framework against poisoning attacks in FL-IoV setting by detecting malicious clients. Our key insight is that existing attacks produce malicious local updates that derive from those of benign ones, resulting in a different distribution among these updates. The difference can be reflected by the Kullback-Leibler divergence between client updates in a single round and also between rounds. Specifically, we use a 2D KL divergence detection method combined with a cumulative reputation module to detect malicious clients. Experiments on FL-IoV tasks show that our method can achieve a detection accuracy of up to 98% and 96% under different single attacks and hybrid attacks, respectively. Our implementation of FLUK on autonomous delivery vehicles shows its effectiveness in real-world scenarios.

11:10
GDL-GNN: Applying GPU Dataloading of Large Datasets for Graph Neural Network Inference
PRESENTER: Haoran Dang

ABSTRACT. Graph neural networks (GNNs) have emerged as a popular choice for analyzing structured data organized as graphs. Nevertheless, GNN models tend to be shallow, failing to fully exploit the capabilities of modern GPUs. Our motivational tests reveal that GPU dataloading for GNN inference yields remarkable performance enhancements when both the graph topology and features reside in GPU memory. Unfortunately, the use of this approach is hindered by the large size of real-world graph datasets. To address this limitation, we introduce GDL-GNN, a partition-based method that incorporates all essential information for inference within each subgraph. It thus combines the efficiency of GPU dataloading with layerwise inference, while maintaining the accuracy of full-neighbor inference. Additional optimization enables GDL-GNN to avoid unnecessary representation computation on halo nodes and to conceal file loading time. Evaluation shows the effectiveness of GDL-GNN in both single- and multi-GPU scenarios, revealing a reduction in inference time of up to 59.9% without compromising accuracy.

11:30
Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models
PRESENTER: Weigang Zhang

ABSTRACT. Hybrid parallelism is popular in training large language models. However, existing efforts have focused on optimizing individual strategies in hybrid parallelism, such as pipeline scheduling, device assignment, etc., which limits the overall training efficiency. This paper explores the intricate dependencies among these strategies and proposes Quartet, a holistic hybrid parallel framework for joint optimization. The novelty lies upon the formulation of parameterized pipeline scheduling and device assignment, as well as analyzing the impact of model scaling on training throughput for the first time. This provides the basis for orchestrating four strategies (model scaling, model splitting, pipeline scheduling, and device assignment) efficiently within a unified framework to maximize the overall training throughput. Evaluation results show that: for Transformer-based language models (i.e., Bert and GPT-2 models), Quartet improves the training throughput by up to 2.16x over the state-of-the-art synchronous hybrid parallel approaches.

11:50
Inference with Transformer Encoders on ARM and RISC-V Multicore Processors

ABSTRACT. We delve into the performance of transformer encoder inference on low-power multi-core processors from two perspectives: First, we conduct a detailed profile of the inference process for two members of the BERT family on a modern multi-core processor, identifying the main bottlenecks and opportunities for improvement. Second, we propose a number of accumulative optimisations for their primary building blocks. For that, we elaborate our own implementation of the general matrix multiplication (GEMM), which dynamically tunes several key parameters yielding relevant performance gains for transformer encoders. Additionally, we introduce a number of strategies to also improve the parallel execution of the transformer block.

Our implementations for ARMv8a and RISC-V multi-core processors with SIMD units, taking as a reference state-of-the-art GEMM implementations (BLIS for ARM and OpenBLAS for RISC-V) reveal accelerations of up to 2.5× for natural language processing tasks.

10:30-12:30 Session 7C: Multidisciplinary, Domain-Specific and Applied Parallel and Distributed Computing (II)
Location: Auditorium
10:30
MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
PRESENTER: Júnior Löff

ABSTRACT. Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR’s adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations.

10:50
TaroRTL: Accelerating RTL Simulation using Coroutine-based Heterogeneous Task Graph Scheduling
PRESENTER: Tsung-Wei Huang

ABSTRACT. RTL simulation is critical for validating hardware designs. However, RTL simulation can be time-consuming for large designs. Existing RTL simulators have leveraged task graph parallelism to accelerate simulation on a CPU- and/or GPU-parallel architecture. Despite the improved performance, they all assume atomic execution per task and do not anticipate multitasking that can bring significant performance advantages. As a result, we introduce TaroRTL, a coroutine-based task graph scheduler for efficient RTL simulation. TaroRTL enables non-blocking GPU and I/O tasks within a task graph, ensuring that threads are not blocked waiting for GPU or I/O tasks to finish. It also designs a coroutine-aware work-stealing algorithm to avoid unnecessary context switches. Compared to a state-of-the-art GPU-accelerated RTL simulator, TaroRTL can further achieve 40–80% speed-up while using fewer CPU resources to simulate large industrial designs.

11:10
Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
PRESENTER: Thiago Maltempi

ABSTRACT. Inverse problems are crucial in various scientific and engineering fields requiring intricate mathematical and computational modeling. An example of such a problem is the Full Waveform Inversion (FWI), which is used in a number of geophysical applications like oil reservoir discovery. Central to solving FWI is Reverse Time Migration (RTM), a Geophysical algorithm for high-resolution subsurface imaging from seismic data that poses considerable computational challenges due to its extensive memory and computation demands. A typical approach to address the memory constraints of RTM includes decomposing the processing tasks in multiple GPUs, checkpointing the intermediate results, and rematerializing the computation from checkpoints when needed. This paper introduces a novel checkpoint prefetching mechanism called GPUZIP. It combines Revolve, a well-known checkpoint algorithm, and GPU-based data compression to improve checkpoint memory utilization. GPUZIP was designed to allow the flexible utilization of different compression algorithms and target applications. Experimental results show that the combination of prefetching and GPU data compression enabled by GPUZIP significantly improves the computation-to-communication ratio for the RTM application. Speed-ups of up to 3.90x and a remarkable 80x Host-to-Device data transfer reduction when running a well-known geophysics benchmark have been achieved. The proposed approach mitigates the computational challenges of RTM and suggests that potential applicability and performance improvements could also be achieved in other scientific computing fields.

11:30
Cloud-native GPU-enabled architecture for parallel video encoding

ABSTRACT. Multimedia streaming has become an essential aspect of contemporary life and the ever-growing demand for high-quality streaming has fostered the development of new video codecs and improvements in content delivery. Cloud computing, particularly cloud architectures, has played a pivotal role in this evolution, offering dynamic resource allocation, parallel execution, and automatic scaling—critical features for HTTP Adaptive Streaming (HAS) applications. This paper presents two specialized containers designed for video encoding (using two implementations of H264: x264 that encodes in the CPU and H264 NVENC that also uses the GPU). These containers are deployed on a Kubernetes cluster with four GPUs. The experiments focus on the performance and resource consumption of the encoder containers under different Kubernetes cluster and replica configurations. The best setup shows a 12.7% reduction in encoding time for x264 and a 15.98% for H264 NVENC compared to the other configurations considered. Besides, the encoding time of H264 NVENC is reduced by a 3.29 factor compared to x264. To test the behavior in realistic scenarios, four videos were encoded at five different resolutions. The mean encoding time per segment is reduced by a 3.75 factor when using H264 NVENC compared to x264. These results hold significant implications for live-streaming applications, particularly for low-latency use cases.

11:50
VLASPH: Smoothed Particle Hydrodynamics on VLA SIMD Architectures
PRESENTER: Xiaokang Fan

ABSTRACT. In the domain of Computational Fluid Dynamics (CFD), Smoothed Particle Hydrodynamics (SPH) has gained increasing popularity compared with traditional grid-based methods. We have seen an increasing demand to accelerate the SPH method with new architectures. Vector Length Agnostic (VLA) SIMD architecture emerges as a promising solution, adopted in ARM and RISC-V for high-performance computing. This paper introduces vlasph, a Smoothed Particle Hydrodynamics on VLA SIMD architectures. vlasph addresses challenges associated with non-contiguous memory data access, SIMD type conversion, and underutilization of SIMD registers. Our comprehensive evaluations across vector lengths ranging from 128 to 2048 bits demonstrate that vlasph achieves significant speedups, with performance gains of up to 4.81x for 2048-bit vectors.

10:30-12:30 Session 7D: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows (I)
Location: -1.A.01
10:30
Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
PRESENTER: Anthony Dugois

ABSTRACT. Modern distributed key-value stores, such as Apache Cassandra, enhance performance through multi-get requests, minimizing network round-trips between the client and the database. However, partitioning these requests for appropriate storage server distribution is non-trivial and may result in imbalances. This study addresses this optimization challenge as the Restricted Assignment problem on Intervals (RAI). We propose an efficient (2-1/m)-approximation algorithm, where m is the number of machines. Then, we generalize the problem to the Restricted Assignment problem on Circular Intervals (RACI), matching key-value store implementations, and we present an optimal O(n log n) algorithm for RACI with fixed machines and unitary jobs. Additionally, we obtain a (4-2/m)-approximation for arbitrary jobs and introduce new heuristics, whose solutions are very close to the optimal in practice. Finally, we show that optimizing multi-get requests individually also leads to global improvements, increasing achieved throughput by 27%-34% in realistic cases compared to state-of-the-art strategy.

10:50
Resource-Aware Heterogeneous Federated Learning with Specialized Local Models
PRESENTER: Sixing Yu

ABSTRACT. Federated Learning (FL) is extensively used to train AI/ML models in distributed and privacy-preserving settings. Participant edge devices in FL systems typically contain non-independent and identically distributed~(Non-IID) private data and unevenly distributed computational resources. Preserving user data privacy while optimizing AI/ML models in a heterogeneous federated network requires us to address data and system/resource heterogeneity. To address these challenges, we propose \underline{R}esource-\underline{a}ware \underline{F}ederated \underline{L}earning~(\proj). \proj allocates resource-aware specialized models to edge devices using Neural Architecture Search~(NAS) and allows heterogeneous model architecture deployment by knowledge extraction and fusion. Combining NAS and FL enables on-demand customized model deployment for resource-diverse edge devices. Furthermore, we propose a multi-model architecture fusion scheme allowing the aggregation of the distributed learning results. Results demonstrate \proj's superior resource efficiency compared to SoTA.

11:10
Makespan Minimization for Scheduling on Heterogeneous Platforms with Precedence Constraints

ABSTRACT. Taking a good decision about the assignment of each job becomes crucial in the era of hybrid and heterogeneous computing systems, such as personal machines equipped with CPUs and GPUs, or HPC platforms composed of multiple generations of processors. In this study, we focus on the fundamental makespan minimization problem of scheduling jobs subject to precedence constraints on a platform composed of $q$ different families of machines. Each family is constituted of identical parallel machines. The processing time of a job depends on the family where it is allocated. We propose an algorithm that guarantees an approximation ratio of $q+1+2\sqrt{q-1}$, which improves upon the existing upper bound of $q(q+1)$. In particular, this algorithm achieves a ratio of $5$ in the special case of a machine composed only of CPUs and GPUs. This specific scenario with $q=2$ families of machines has been widely studied by the scientific community in recent years. The best known lower and upper bounds known so far were $3$ and $5.83$, respectively.

11:30
Deadline-driven Enhancements and Response Time Analysis of ROS2 Multi-threaded Executors
PRESENTER: Zhengda Wu

ABSTRACT. The second-generation Robot Operating System, ROS2, has gained significant interest due to its enhanced real-time capabilities. Although some studies are conducted on ROS2 multi-threaded executors to improve real-time capabilities, the existing scheduling schemes of multi-threaded executors may incur high latency for low-priority chains and priority inversion. Therefore, it remains a challenging problem to meet the real-time constraints of chain instances for multi-threaded executors. In this paper, we design a Chain-Instance-Level Earliest Deadline First (CIL-EDF) scheduling scheme for multi-threaded executors and present a comprehensive response time analytical model (RTAM) for the proposed scheduler; In this scheduler, callback instances are prioritized based on the corresponding chain instances of deadline to meet the real-time constraints for chain instances. Besides, the proposed RTAM is capable of analyzing chains with both arbitrary and constrained deadlines, as well as accounting for the impact of mutually-exclusive callback groups. To validate the scheduler and analytical model, we conduct a series of case studies. The results demonstrate the CIL-EDF outperforms the default and Chain-Level scheduling schemes of multi-threaded executors in terms of worst-case response time and schedulability, and the RTAM can safely upper-bound response times under all chain instances that are schedulable in the system.

11:50
Light-weight prediction for improving energy consumption in HPC platforms (Artifact)

ABSTRACT. With the increase of demand for computing resources and the struggle to provide the necessary energy, power-aware resource management is becoming an major issue for the High-performance computing (HPC) community. Including reliable energy management to a supercomputer's resource and job management system (RJMS) is not an easy task. The energy consumption of jobs is rarely known in advance and the workload of every machine is unique and different from the others.

We argue that the first step toward properly managing energy is to deeply understand the energy consumption of the workload, which involves predicting the workload's power consumption and exploiting it by using smart power-aware scheduling algorithms. Crucial questions are (i) how sophisticated a prediction method needs to be to provide accurate workload power predictions, and (ii) to what point an accurate workload's power prediction translates into efficient energy management.

In this work, we propose a method to predict and exploit HPC workloads' power consumption, with the objective of reducing the supercomputer's power consumption while maintaining the management (scheduling) performance of the RJMS. Our method exploits workload submission logs with power monitoring data, and relies on a mix of light-weight regression methods and a classical first- and best-fit heuristics.

We base this study on logs of Marconi 100, a 980 servers supercomputer. We show using simulation that a light-weight history-based prediction method can provide accurate enough power prediction to improve the energy management of a large scale supercomputer compared to energy-unaware scheduling algorithms. These improvements have no significant negative impact on performance.

12:30-13:30Lunch Break
13:30-15:30 Session 8A: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows (II)
Location: -1.A.01
13:30
Efficient Coupling Streaming AI and Ensemble Simulations on HPC Clusters
PRESENTER: Dan Huang

ABSTRACT. As scientific research grows more complex, a wide range of HPC-AI workflows have been proposed, in which machine learning model learns from interactions with ensemble simulations on HPC clusters. However, existing distributed frameworks and solutions fall short in facilitating this intensive integration, struggling with challenges like non-trivial application migration, heterogeneous task management and inefficient I/O operations. To address these challenges, we present RTAI, a distributed runtime system aimed at optimizing emerging HPC-AI workflows. Specifically, it consists of several key components: RTAI Packer is a flexible abstraction that parallelizes and constructs a dynamic dependency graph of the workflow. RTAI Orchestrator improves workflow efficiency and cluster-wide resource utilization by adaptively organizing resources and managing heterogeneous tasks. RTAI FileHub is a tailored ad-hoc file system that is employed for in-memory caching of dynamically generated files and enhancing I/O efficiency. Our experiments show that RTAI significantly improves the makespan, cluster utilization, scalability and usability for HPC-AI applications when compared to other candidate solutions like Ray and Radical-Pilot.

13:50
sAirflow: Adopting Serverless in a Legacy Workflow Scheduler
PRESENTER: Paweł Żuk

ABSTRACT. Serverless clouds promise efficient scaling, reduced toil and monetary costs. Yet, serverless-ing a complex, legacy application might require major refactoring and thus is risky. To reduce that risk, we propose to limit code modifications by relying on change data capture (CDC) and message queues for internal communication. To get the efficiency, we rely on Function-as-a-Service (FaaS). We take Airflow, an industry-standard workflow system. Our system, sAirflow, is the first adaptation of the control plane and workers to the serverless cloud --- and it maintains the same interface and most of the code. Experimentally, we show that sAirflow delivers the key serverless benefits: scaling and cost reduction. We compare sAirflow to MWAA, a managed (SaaS) Airflow. On Alibaba benchmarks on warm systems, sAirflow performs similarly while halving the monetary cost. On highly parallel workflows on cold systems, sAirflow scales out in seconds to 125 workers, reducing makespan by 2x-7x.

14:10
Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
PRESENTER: Farah Ait Salaht

ABSTRACT. Fog Computing extends Cloud Computing to the network edge, enhancing distributed computing to meet the growing needs of Internet of Things (IoT) applications requiring real-time or near-real-time analysis. This research focuses on efficiently managing the vast amounts of data generated by IoT devices, employing an advanced replication and placement strategy for application components across distributed Fog Computing nodes. This approach enables scalable and parallel data processing to adapt to demand fluctuations, prevent over-provisioning, and maintain low response times, making it particularly effective for the dynamic nature of data stream processing in IoT applications.

In this paper, we propose an Optimal IoT Service Replication and Placement (OSRP) model, formulated as a constraint satisfaction problem, that considers the diverse requirements of IoT applications and the available infrastructure resources. Our model is designed to be adaptive and easily extensible, addressing the challenge of workload variability through effective optimization. Numerical evaluations confirm the superior performance and scalability of our model over existing methods, while maintaining quality of service constraints. This highlights the potential of our approach to improve efficiency and resource management in Fog Computing environments.

14:30
Scheduling distributed I/O resources in HPC systems
PRESENTER: Alexis Bandet

ABSTRACT. This paper presents a comprehensive investigation on optimizing I/O performance in the access to distributed I/O resources in high-performance computing (HPC) environments. I/O resources, such as the I/O forwarding nodes and object storage targets (OST), are shared between a subset of applications. Each application has access to a subset of them, and multiple applications can access the same resources. We propose heuristics to schedule these distributed I/O resources in two steps: for a set of applications, determining the number of I/O resources each will use (allocation) and which resources they will use (placement). We discuss a wide range of required information about applications' characteristics that can be used by the scheduling algorithms. Despite the fact that a higher level of application knowledge is associated with enhanced performance, our comprehensive analysis indicates that strategic decision-making with limited information can still yield significant enhancements in most scenarios. This research provides insights into the trade-offs between the depth of application characterization and the practicality of scheduling I/O resources.

14:50
Node Bundle Scheduling: An Ultra-Low Latency Traffic Scheduling Algorithm for TAS-based Time-Sensitive Networks
PRESENTER: Qian Yang

ABSTRACT. Time Aware Shaper (TAS) and the corresponding traffic scheduling algorithms jointly ensure low-latency and low-jitter data transmission in Time-Sensitive Networking (TSN). However, existing TAS-based traffic scheduling algorithms suffer from either high scheduling failure rate or heavy computation overhead. Such limitations make the algorithms unable to satisfy some TSN applications with harsh scheduling requirements like ultra-low latency and frequent scenarios reconfiguration requirements, such as smart factories in industrial control. Therefore, more efficient traffic scheduling algorithms are required for the specific TSN applications.

To achieve this goal, a Node Bundle Scheduling (NBS) algorithm is proposed in this paper. NBS is based on a TAS traffic scheduling abstraction which greatly simplifies the scheduling problem under the premise of sacrificing a certain amount of bandwidth. To improve the scheduling success rate and optimize the scheduling efficiency, it considers resource allocation in both temporal and spatial dimensions. NBS is compared with state-of-the-art TAS-based scheduling algorithms under different industrial topologies. The results show that the scheduling success rate of NBS is 34.25% higher than that of Tabu Search under scenarios with relatively low latency requirements. Compared with the typical SMT solver, NBS can reduce the time overhead by over 99.9% in some complex scheduling scenarios, at a cost of extra 11.86% bandwidth consumption.

13:30-15:30 Session 8B: Architectures and Accelerators (III)
Location: -1.A.05
13:30
Compact Parallel Hash Tables on the GPU (Artifact)
PRESENTER: Steef Hegeman

ABSTRACT. On the GPU, hash table operation speed is determined in large part by cache line efficiency, and state-of-the-art hashing schemes thus divide tables into cache line-sized buckets. This raises the question whether performance can be further improved by increasing the number of entries that fit in such buckets. Known compact hashing techniques have not yet been adapted to the massively parallel setting, nor have they been evaluated on the GPU. We consider a compact version of bucketed cuckoo hashing, and a version of compact iceberg hashing suitable for the GPU. We discuss the tables from a theoretical perspective, and provide an open source implementation of both schemes in CUDA for comparative benchmarking. In terms of performance, the state-of-the-art cuckoo hashing benefits from compactness on lookups and insertions (most experiments show at least 10–20% increase in throughput), and the iceberg table benefits significantly, to the point of being comparable to compact cuckoo hashing—while supporting performant dynamic operation.

13:50
Hybrid Congestion Control for BXI-based Interconnection Networks

ABSTRACT. Congestion threatens the performance of high-performance interconnection networks in Supercomputers and Data centers, which has focused research attention in the last decades, resulting in many proposals for congestion control. The most successful ones involve either the use of traffic-injection throttling (e.g., DCQCN) to reduce the rate at which congesting flows are injected into the network, or the use of virtual channels (VCs) to separate congesting from non-congesting flows, thus reducing the main adverse effect of congestion: the head-of-line (HoL) blocking. In the first case, the rate reduction may be based on obsolete information about the network status, so the mechanism may overreact, underreact, or react too late. The second approach does not achieve a complete HoL blocking elimination. This paper proposes a combination of both approaches into a hybrid congestion control technique that uses VCs to separate congesting and non-congesting flows while, at the same time, throttling the injection of congesting flows at NICs when necessary. Our proposal can be applied to Ethernet-based switches, such as those being designed for the new version of the BXI network technology. However, our proposal can be applied to any switch architecture using buffers at both input and output ports. From the results of the simulation experiments run to evaluate our proposal, we can conclude that the hybrid technique significantly outperforms the congestion control solutions usually implemented in current Ethernet-based networks.

14:10
Exploring processor micro-architectures optimised for BLAS3 micro-kernels
PRESENTER: Stepan Nassyr

ABSTRACT. Dense matrix-matrix operations are relevant for a broad range of numerical applications including those based on machine learning methods. Past research has led to a good understanding of how these operations can be mapped in a generic manner on typical processor architectures with multiple cache levels such that near-optimal performance can be reached. However, while commonly used micro-architectures are typically suitable for such operations, their architectural parameters need to be suitably tuned. The performance of highly optimised implementations of these operations relies on micro-kernels that are often hand-written. Given the increased variety of instruction set architectures and SIMD instruction extensions, this becomes challenging. In this paper, we present and implement a methodology for an exhaustive exploration of a processor core micro-architecture design space based on gem5 simulations. Furthermore, we present a tool for generating efficiently vectorised code leveraging Arm’s SVE and RISC-V’s RVV instructions. It enables automatisation of the generation of micro-kernels and, therefore, the generation of a large range of such kernels. The results provide insights both, to micro-architecture architects as well as micro-kernel developers.

14:30
Watt: A Write-optimized RRAM-based Accelerator for Attention
PRESENTER: Xuan Zhang

ABSTRACT. Attention-based models, such as transformers, have achieved remarkable success across various tasks. However, their deployment is hindered by challenges such as high memory requirements, long inference latency, and significant power consumption. One potential solution for accelerating attention is the use of resistive random access memory (RRAM), which exploits process-in-memory (PIM) capability. However, current RRAM-based accelerators encounter issues with expensive writing cost. Accordingly, we exploit a write-optimized RRAM-based accelerator for attention, which can further reduce the number of intermediate data written to the crossbars, and effectively alleviates workload imbalance among crossbars with limited capacity. Specifically, given that the importance and similarity of tokens in the sequences, we design the importance detector and similarity detector to maximally compress the number of intermediate data K^T and V written to the crossbars. Further, since many vectors in the K^T and V are pruned, which makes the vectors number of K^T and V written to crossbars in each inference different. It leads to the workload imbalance problem among crossbars. Therefore, We propose workload-aware dynamic scheduler, which contains top-k engine and remapping engine. It first sorts total write count of each crossbar and each inference write count by top-k engine, and then schedules the inference tasks to the crossbars through the remapping engine. Experimental results show that Watt averagely achieves 6.5X, 4.0X, and 2.1X speedup compared to the state of the art accelerators Sanger, TransPIM, and Retransformer. Meanwhile, it averagely achieves 18.2X, 3.2X, and 2.8X energy saving with respect to the three accelerators.

13:30-15:30 Session 8C: Theory and Algorithms (II)
Location: -1.A.06
13:30
A Fast Wait-Free Solution to Read-Reclaim Races in Reference Counting (Artifact)

ABSTRACT. Reference counting is a safe alternative to manual memory management for multithreaded programs, and is supported by the standard libraries of several major programming languages (e.g., Arc in Rust, shared\_ptr and atomic<shared\_ptr> in C++).

In concurrent reference counting, read-reclaim races, where a read of a mutable variable races with a write that deallocates the old value, require special handling: use-after-free errors occur if the object is deallocated by the writing thread before the reading thread can increment the reference count. Existing solutions are not wait-free, have a space overhead and/or take time linear in the number of threads for updates.

We introduce an implementation for concurrent reference counting with no delayed reclamation, which is wait-free, easy to implement and fast in practice. Writes operate in constant time and reads usually in constant time, and in the worst-case, linear with respect to the number of threads that actually work with the variable. Our algorithm is based on the split reference count technique, which is used in production but is only lock-free. We re-explain this technique as a special case of weighted reference counting, to arrive at a simpler explanation of this technique that is often claimed to be difficult. We show our algorithm works well in practice by benchmarking it against other atomic shared pointer implementations.

13:50
QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique
PRESENTER: Qasim Abbas

ABSTRACT. The Maximum Weighted Clique (MWC) problem remains challenging due to its unfavourable time complexity. In this paper, we analyze the execution of exact search-based MWC algorithms and show that high-accuracy weighted cliques can be discovered in the early stages of the execution if searching the combinatorial space is performed systematically. Based on this observation, we introduce QClique as an approximate MWC algorithm that processes the search space as long as better cliques are expected. QClique uses a tunable parameter to trade-off between accuracy vs. execution time and delivers 4.7-82.3 times speedup in comparison to previous state-of-the-art MWC algorithms while providing 91.4% accuracy and achieves a parallel speedup of up to 56 times on 128 threads. Additionally, we introduce a technique to accelerate exact MWC algorithms using QClique by priming the exact algorithm with a high-accuracy initial clique. This approach accelerates a state-of-the-art exact MWC algorithm by 3.3 times on average.

14:10
ALZI: An Improved Parallel Algorithm for Finding Connected Components in Large Graphs
PRESENTER: Maleq Khan

ABSTRACT. Finding connected components is a fundamental problem in graph and network analysis. It also serves as a subroutine in other graph problems. There are efficient sequential algorithms for finding connected components in a graph. However, a sequential algorithm can take a long time for a large graph. Parallel algorithms can significantly speed up computation using multiple processors. This paper presents a fast shared-memory parallel algorithm named ALZI (Afforest with LinkJump and Zero Implant) to find connected components in a graph. ALZI is an improvement of a recent state-of-the-art parallel algorithm called Afforest. We propose a few non-trivial optimizations that result in better performance in terms of runtime and scalability. We performed rigorous experimentation using a wide variety of real-world and artificial graphs to evaluate the performance of ALZI. The experimental results show that ALZI is 1.4-2.3 times faster than Afforest on these graphs and provides better scalability than Afforest. ALZI has the ability to work with very large graphs. On a Kronecker graph with 4.2 billion edges, ALZI can find the connected components in just 1.02 seconds using 128 processors.

14:30
Mixed precision randomized low-rank approximation with GPU tensor cores
PRESENTER: Matthieu Robeyns

ABSTRACT. Randomized projection methods have been shown to be very efficient at computing low-rank approximations (LRA) of large scale matrices. In this work, we investigate the design and development of such methods capable of exploiting modern mixed precision accelerators, and more specifically GPUs equipped with tensor core units. We combine three new ideas to exploit mixed precision arithmetic in randomized LRA. The first is to perform the matrix multiplication with mixed precision fp16/fp32 tensor cores. The second is to use CholeskyQR orthonormalization, which is much faster on GPUs, while mitigating its numerical instability by using fp64 arithmetic. The third is to use a recently proposed iterative refinement method for LRA to improve the accuracy of the LRA by calling it twice. We implement the proposed approach on various recent GPU architectures and analyze its performance and accuracy. We compare with a standard randomized LRA entirely in fp32 arithmetic, which achieves an average accuracy of order 10^-4. Our results show that our approach without refinement is up to 10x faster, with an average accuracy of order 10^-2, which may be acceptable for some applications. Otherwise, we show that using refinement significantly improves the accuracy to an average of order $10^{-5}$, while remaining up to 2.5x faster than the standard fp32 randomized LRA.

14:50
GPU-Accelerated BFS for Dynamic Networks
PRESENTER: Filippo Ziche

ABSTRACT. The breadth-first-search (BFS) algorithm serves as a fundamental building block for graph traversal with a wide range of applications, spanning from the electronic design automation (EDA) field to social network analysis. Many contemporary real-world networks are dynamic and evolve rapidly over time. In such cases, recomputing the BFS from scratch after each graph modification becomes impractical. While parallel solutions, particularly for GPUs, have been introduced to handle the size complexity of static networks, none have addressed the issue of work-efficiency in dynamic networks. In this paper, we propose a GPU-based BFS implementation capable of processing batches of network updates concurrently. Our solution leverages batch information to minimize the total workload required to update the BFS result while also enhancing data locality for future updates. We also introduce a technique for relabeling nodes, enhancing locality during dynamic BFS traversal. We present experimental results on a diverse set of large networks with varying characteristics and batch sizes.

13:30-15:30 Session 8D: Programming, Compilers and Performance (II)
Location: -1.A.04
13:30
Deconstructing HPL-MxP benchmark: a numerical perspective
PRESENTER: Eric Petit

ABSTRACT. HPL-MxP has become a widely accepted benchmark for assessing high-performance computing (HPC) systems' capabilities for Artificial Intelligence (AI) workloads. However, the benchmark's representativeness of real-world HPC and AI workloads is unclear. In this paper, we discuss the HPL-MxP benchmark from a numerical perspective and propose new rules and data generation for numerically sounding comparison of low precision format implementations. We present experiments showing that current HPL-MxP benchmark cannot be considered as a numerically relevant benchmark to prove numerical superiority of a new format or algorithm. We propose to better specify this requirements for numerical formats to produce comparable performance numbers, and suggest new input data generation to make it numerically relevant. We validate our proposal on Int8, Int4, and BF16 implementations to demonstrate the numerical significance of the benchmark using our new generator.

13:50
OMPGPT: A Generative Pre-trained Transformer Model for OpenMP

ABSTRACT. Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). This trend led to the development of code-based large language models such as StarCoder, WizardCoder, and CodeLlama, which are trained extensively on vast repositories of code and programming languages. While the generic abilities of these code LLMs are useful for many programmers in tasks like code generation, the area of high-performance computing (HPC) has a narrower set of requirements that make a smaller and more domain-specific model a smarter choice. This paper presents OMPGPT, a novel domain-specific model meticulously designed to harness the inherent strengths of language models for OpenMP pragma generation. Furthermore, we leverage prompt engineering techniques from the NLP domain to create Chain-of-OMP, an innovative strategy designed to enhance OMPGPT's effectiveness. Our extensive evaluations demonstrate that OMPGPT outperforms existing large language models specialized in OpenMP tasks and maintains a notably smaller size, aligning it more closely with the typical hardware constraints of HPC environments. We consider our contribution as a pivotal bridge, connecting the advantage of language models with the specific demands of HPC tasks.

14:10
ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRA
PRESENTER: Bizhao Shi

ABSTRACT. Image processing is a field with extensive practical applications and rapidly evolving algorithms, demanding hardware platforms that offer both high energy efficiency and flexible programmability. Coarse-grained Reconfigurable Arrays (CGRAs) exhibit significant potential due to their parallel operator-level PEs and spatio-temporal reconfigurability. However, the mapping of image processing applications onto CGRAs presents two primary challenges: 1) the complexities of low-level CGRA programming, constrained by multiple factors, create obstacles for developers; and 2) the disparities between coarse-grained image pipelines and the fine-grained pipelined loops executed on CGRAs result in a vast program transformation space. To address these challenges, we introduce ImageMap, a sophisticated mapping framework derived from Halide, a domain-specific language for image processing, tailored specifically for CGRAs. Our approach involves multi-level partitioning with extended Halide scheduling primitives to systematically decompose complex applications. Additionally, we propose a hierarchical program exploration algorithm designed to cater to CGRAs, taking into account partitioning dimensions and CGRA performance modeling. Furthermore, we develop an automated compilation framework and integrate various optimization techniques to enhance mapping quality. In the evaluations, ImageMap demonstrates significant performance improvements compared to existing studies across various CGRA architectures and applications. Besides, the proposed automated compilation workflow also effectively bridges the gap between software programming and CGRA execution.

14:30
Predicting GPU kernel's performance on upcoming architectures
PRESENTER: Lucas Van Lanker

ABSTRACT. With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires to predict the performance on a non-existing hardware platform. In this paper, we propose a new methodology for predicting the performance of kernels running on GPUs. This method analyzes the behavior of an application running on an existing platform, and projects its performance on another GPU based on the target hardware characteristics. The performance projection relies on a hierarchical roofline model as well as on a comparison of the kernel's assembly instructions of both GPUs to estimate the operational intensity of the target GPU. We demonstrate the validity of our methodology on modern NVIDIA GPUs on several mini-applications. The experiments show that the performance is predicted with a mean absolute percentage error of 20.3 % for LULESH, 10.2 % for MiniMDock and 5.9 % for Quicksilver.

14:50
A Mechanism to Generate Interception Based Tools for HPC Libraries
PRESENTER: Bengisu Elis

ABSTRACT. Software tools are integral components of the HPC software stack and provide invaluable measurements and insights into application run time and system behaviour to end users, code developers and system administrators. However, most tools currently do not support performance analysis at the granularity of libraries, which are the most important level of abstraction for code when developing modern applications. To overcome this limitation, we present a novel infrastructure that can auto-generate tool interfaces that enable interception at library-level. This opens the door to deploying tools at the right level of abstraction and with that to many use cases previous impossible or infeasible. We demonstrate an implementation of our approach alongside several use cases that show how such library-level tooling can support application and system optimization.

15:30-16:00Coffee Break
16:00-17:30 Session 9A: Data analytics, AI, and Computational Science (III)
Location: -1.A.06
16:00
PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact)
PRESENTER: Kohei Hiraga

ABSTRACT. This paper introduces PEANUTS, a system designed to improve I/O performance for Single Shared File (SSF) in high-performance computing scenarios. I/O performance evaluation of SSF on 100 nodes demonstrated write speeds of 2.47~TB/s, remote read speeds of 2.39 TB/s, and local read speeds of 7.75 TB/s. These outcomes are close to the hardware's performance limits and represent significant improvements of approximately 2.2 times, 2.3 times, and 7.5 times, respectively, compared to existing state-of-the-art systems. A major feature of PEANUTS is the integration of persistent memory with RDMA one-sided communication, supporting high-speed and low-latency data transfers without the need for separate storage servers on the compute nodes. This configuration allows all compute node CPU cores to be fully available for application processing. Seamless integration with the MPI runtime enables rapid data sharing through the MPI-IO interface. The advancements by PEANUTS suggest that utilizing large-scale SSF could solidify as the standard I/O framework in HPC, demonstrating a viable solution to overcome traditional performance limitations.

16:20
Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems
PRESENTER: Lin Wang

ABSTRACT. With the surge of AI services, prediction serving systems (PSSes) have been widely deployed. PSSes are often run on many workers and thus are prone to stragglers (slowdowns or failures), so it is critical to design straggler-resilient PSSes for low latency of prediction. A traditional way is replication that assigns the same prediction job to multiple workers, while incuring significant resources overheads due to its replicated redundant jobs. Recently, coded distributed computation (CDC) has become a more resource-efficient way than replication, as it encodes the prediction job into parity units for prediction reconstruction via decoding. However, we find that state-of-the-art CDC methods either trade accuracy for low latency with the encoder and decoder both simple, or trade latency for high accuracy with the encoder and decoder both complicated, leading to unbalance between accuracy and latency due to the above symmetry between the encoder and decoder. Our insight is that the encoder is always used in CDC-based prediction, while the decoder is only used when stragglers occur. In this paper, we first propose a new asymmetric CDC framework based on the insight, called AsymCDC, composed of a simple encoder but a complicated decoder, such that the encoder’s simplicity makes a low encoding time that reduces the latency largely, while the decoder’s complexity can be helpful for accuracy. Further, we design the decoder’s complexity in two steps: i) an exact decoding method that leverages an invertible neural network’s (INN) invertibility to make the decoding have no accuracy loss, and ii) a decoder compacting method that reshapes INN outputs to utilize knowledge distillation effectively that compacts the decoder for low decoding time. We prototype AsymCDC atop Clipper and experiments show that the prediction accuracy of AsymCDC is approximately the same as stateof-the-arts with the encoder and decoder both complicated, while the latency of AsymCDC only exceeds that of state-of-the-arts with the encoder and decoder both simple by no more than 2.6%.

16:40
Athena: Add More Intelligence to RMT-based Network Data Plane with Low-bit Quantization
PRESENTER: Yunkun Liao

ABSTRACT. Performing per-packet neural network (NN) inference on the network data plane is a promising approach for accurate and fast decision-making in computer network. However, data plane architecture like the Reconfigurable Match Tables (RMT) pipeline has limited support for NN computation. Previous efforts have utilized the Binary Neuron Network (BNN) as a compromise, but the accuracy loss of BNN is high. Inspired by the accuracy gain of the low-bit (2-bit and 4-bit) models, this paper proposes Athena. Athena can deploy the sparse low-bit quantization models on RMT. Compared with the BNN-based state-of-the-art, Athena achieves new Pareto frontier regarding model accuracy and inference latency.

17:00
Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
PRESENTER: Zhi Lu

ABSTRACT. Federated learning (FL) is a distributed machine learning approach that reduces data transfer by aggregating gradients from multiple users. However, this process raises concerns about user privacy, leading to the emergence of privacy preserving FL. Unfortunately, this development poses new Byzantine-robustness challenges as poisoning attacks become difficult to detect. Existing byzantine-robust algorithms operate primarily in plaintext, and crucially, current byzantine-robust privacy FL methods fail to concurrently defend against adaptive attacks. In response, we propose a lightweight, byzantine-robust, and privacy-preserving federated learning framework (LRFL), employing shuffle functions and encryption masks to ensure privacy. In addition, we comprehensively calculate the similarity of the direction and magnitude of each gradient vector to ensure byzantine-robustness. To the best of our knowledge, LRFL is the first byzantine-robust privacy preserving FL capable of identifying malicious users based on gradient angles and magnitudes. What's more, the theoretical complexity of LRFL is $\mathcal{O}(dN + dN\log N)$, comparable to byzantine-robust FL with user number $N$ and gradient dimension $d$. Experimental results demonstrate that LRFL achieves similar accuracy to state-of-the-art methods under multiple attack scenarios. 

17:20
FedGG: Leveraging Generative Adversarial Networks and Gradient Smoothing for Privacy Protection in Federated Learning
PRESENTER: Shuchun Xu

ABSTRACT. Gradient leakage attacks allow attackers to infer private data, which raises concerns about the data leakage. To solve this problem, a series of methods have been proposed, while previously proposed methods have two weaknesses. First, adding noise (e.g., Differential privacy) to client-shared gradients reduces private data leaks but harms performance of model and leaves room for data recovery attacks (e.g., Gradient leak attacks). Second, encrypting shared gradients (e.g., Homomorphic encryption) enhances security but demands prohibitively high computational costs, making it impractical for resource-constrained edge devices. This work proposes a novel federated learning method that leverages generative adversarial networks and gradient smoothing, which generates pseudo-data through Wasserstein GAN(WGAN) and retains classification characteristics. Gradient smoothing can suppress gradients with high frequency changes; To improve the diversity of training data, launching data augmentation by mixup. Experiments show that compared with common defense methods, the MES-I with noise and gradient clipping are 0.5278 and 0.1036, respectively, while the MES-I of FedGG is 0.6422.

16:00-17:30 Session 9B: Multidisciplinary, Domain-Specific and Applied Parallel and Distributed Computing (III)
Location: -1.A.01
16:00
Vectorizing Sparse Blocks of Graph Matrices for SpMV
PRESENTER: Yuang Chen

ABSTRACT. Sparse Matrix-Vector Multiplication (SpMV) plays a pivotal role in a wide range of scientific computations and graph processing algorithms. However, SpMV operations on graph matrices often encounter challenges such as inefficient cache utilization and imbalanced workloads. This paper presents a novel solution, named "VeCa", to accelerating SpMV for sparse graph matrices by integrating selective vectorization with hierarchical blocking. Firstly, the matrix is divided into small blocks fitting in the cache, where multi-level partitioning might occur according to the estimated workloads per blocks. Then, the rows within each block is selectively vectorized with distinct instruction sets. Experimental results show that VeCa considerably outperforms state-of-the-art SpMV methods and graph processing systems, achieving a speedup of 1.29x compared to the second fastest approach. Moreover, a comprehensive evaluation is conducted, analyzing performance factors, branch prediction, cache efficiency, and parameter tuning, to promote a thorough understanding of VeCa's efficacy.

16:20
On the use of hybrid computing for accelerating EEG preprocessing
PRESENTER: L. Felipe Romero

ABSTRACT. Electroencephalography (EEG) records have multiple applications in Medicine, such as epilepsy seizure detection. Massive datasets with EEG signals are also used for training Machine Learning models. However, these data are usually stored in large files in a complex format called EDF, and processing this kind of file can be extremely computationally demanding. This study describes a novel approach combining a high-speed EDF file reader library developed in C++ with accelerated Fast Fourier Transform (FFT) on GPUs. It enhances the computational efficiency of working with this kind of file and outperforms the existing solutions for this purpose. Besides, the integrated FFT processing capabilities allow for simplifying EDF data and selecting sampling frequencies efficiently. These combined advancements improve data handling and processing, advancing this research field.

16:40
AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems
PRESENTER: Jie Jia

ABSTRACT. With the scaling-up of high-performance computing (HPC) systems, the resilience has become an important challenge. As a widely used resilience technique for HPC systems, checkpointing saves checkpoints of the system during the execution of parallel programs, and in case of failure, recovers the execution of the program from the most recent checkpoint. However, large-scale parallel programs often produce thousands of processes, and result large-volume simultaneous data-writings on each checkpoint, which impacts the storage as well as the parallel file systems of HPC. To tackle this problem, this paper proposes AdapCK, an I/O-optimization scheme for checkpointing on large-scale HPC systems. AdapCK consists of two main parts: a load-balancing mechanism used for balancing workloads across low-level storage volumes on checkpointing, and a throughput-aware checkpoint-data writing mechanism used for reducing I/O contentions and increasing utilization of I/O-bandwidth. Experiment results show that the AdapCK can reduce the checkpoint time by more than 30%, up to 54.5%.

17:00
Pipe-AGCM: A Fine-grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation Model
PRESENTER: Dazheng Liu

ABSTRACT. Atmospheric General Circulation Model (AGCM) is widely used for predicting climate change and extreme weather. AGCM has been scaled up to thousands of cores on HPC platforms to provide timely forecasts. However, the communication overhead significantly impacts the performance and scalability of AGCM when using a two-dimensional domain decomposition algorithm in parallel. To tackle this issue, we introduce Pipe-AGCM, a fine-grain pipelining scheme to reduce communication overhead. We divide latitudes into groups to overlap communication from one group with the computation of another. By adjusting the group count, we can establish a pipelining scheme, optimizing AGCM performance. Pipe-AGCM has demonstrated efficiency and scalability in experiments on HPC platforms. The pipelining scheme reduced the communication time and wall clock time by 78.02% and 29.68%, respectively, at its best. This optimized scheme shows significant potential for accelerating atmospheric models on large-scale supercomputers.

16:00-17:30 Session 9C: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows (III)
Location: -1.A.05
16:00
Hurry: Dynamic Collaborative Framework For Low-orbit Mega-Constellation Data Downloading
PRESENTER: Handong Luo

ABSTRACT. Low-orbit mega-constellation networks (LMCN), which utilize thousands of satellites to provide a variety of network services and collect a wide range of space information, is a rapidly growing field. Each satellite collects TB-level data daily, including delay-sensitive data used for crucial tasks, such as military surveillance, natural disaster monitoring, and weather forecasting. According to NASA's statement, these data need to be downloaded to the ground for processing within 3-5 hours. To reduce the time required for satellite data downloads, the state-of-the-art solution known as CoDld, which is only available for small constellations, uses an iterative method for cooperative downloads via inter-satellite links (ISL). However, in LMCN, the time required to download the same amount of data using CoDld will exponentially increases compared to downloading the same amount of data in a small constellation. We have identified and analyzed the reasons for this degradation phenomenon and proposed a new satellite data download framework, named Hurry. By modeling and mapping satellite topology changes and data transmission to Time-Expanded Graphs, we implemented our algorithm within the Hurry framework to avoid degradation effects. In the fixed data volume download evaluation, Hurry achieves 100% completion of the download task while the CoDld only reached 44% of download progress. In continuous data generation evaluation, the Hurry flow algorithm improves throughput from 11% to 66% compared to the CoDld in different scenarios.

16:20
Context-aware Runtime Type Prediction for Heterogeneous Microservices
PRESENTER: Yibing Lin

ABSTRACT. Serverless function is becoming increasingly popular as a new runtime type for application execution. However, it is not suitable for arbitrary microservices. Different components in microservice applications are often suitable to be deployed with different runtime types according to their own attributes and workload characteristics. However, the complex topology of microservice applications often leads to difficulty in determining the optimal runtime types of microservices, and the existing container-based microservice systems only support a single runtime type. Therefore, we propose a targeted heterogeneous runtime unified orchestration solution to address the above problems. First, we propose an execution need characterization model for microservice applications and introduce a microservice resource sensitivity type analysis method. Second, we propose a graph neural network-based approach for context-aware accurate prediction of heterogeneous microservice runtime types, which synthesizes the characteristics of each component and the correlation relationships between components to determine the optimal runtime type specific to each microservice. Third, we design and implement a unified orchestration system for heterogeneous microservice applications to support user-independent automated orchestration of serverful and serverless microservices. Finally, we validate the advantages of the system in terms of service performance and cost efficiency through experiments on real clusters.

16:40
PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
PRESENTER: Yuandou Wang

ABSTRACT. Running deep neural networks for large medical images is a resource-hungry and time-consuming task with centralized computing. While outsourcing such medical image processing tasks to hybrid clouds has benefits, such as a significant reduction of execution time and monetary cost, due to privacy concerns, it is still challenging to process sensitive medical images over clouds, which would hinder their deployment in many real-world applications. To overcome this, we first formulate the overall optimization objectives of the privacy-preserving distributed system model, i.e., minimizing the amount of information about the private data learned by the adversaries throughout the process, reducing the maximum execution time and cost under the user budget constraint. We propose a novel privacy-preserving and cost-effective solution called \textbf{PriCE} to solve this multi-objective optimization problem. We performed extensive simulation experiments for artifact detection tasks on medical images using an ensemble of five deep convolutional neural network inferences as the workflow task. Experimental results show that PriCE successfully splits a wide range of input gigapixel medical images with graph-coloring-based strategies, yielding desired output utility and lowering the privacy risk, maximum completion time, and monetary cost under the maximum total cost.

17:00
EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
PRESENTER: Yiming Yao

ABSTRACT. As an open-source key-value system, Redis has been widely used in internet service stations. A key-value lookup in Redis usually involves several chained memory accesses, and the address translation overhead can significantly increase the lookup latency. This paper introduces a new software-based approach that can reduce chained memory accesses and total address translation overhead of lookup requests by placing key-value entries in a specially managed memory space organized as huge pages with a fast hash table and enabling a fast lookup approach with simple hash functions, while keeping the integrity of Redis data structure. The new approach brings up to 1.38× average speedup for the key-value retrieval process, and significantly reduces misses in TLB and last-level cache. It outperforms SLB, an address caching software approach and has match the performance to STLT, a software-hardware co-designed address-centric design. 

18:45-20:00 Walking Tour

Did you know that Madrid was a relatively small town, practically unknown outside of Spain before the year 1561?  The city’s fortunes changed that year when it burst upon the scene of European politics by becoming the permanent capital of Spain.  The dynasty at the head of this change was known as the Habsburgs, a family that ruled the country and much of the known world from the 16th to the 18th century, and who were referred to in Spain as the House of Austria.

The historic center of Madrid was built up predominantly during the reign of that same dynasty, and this fascinating walking tour of Madrid de los Austrias, takes you through that area, giving you the best introduction to the Spanish capital.