EURO-PAR2021: EURO-PAR: 27TH INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR THURSDAY, SEPTEMBER 2ND
Days:
previous day
next day
all days

View: session overviewtalk overview

09:30-11:00 Session 16: Regular Papers 4: Scheduling & Load Balancing I
09:30
Collaborative GPU Preemption via Spatial Multitasking for Efficient GPU Sharing
PRESENTER: Zhuoran Ji

ABSTRACT. GPUs have been widely used in data centers and are often over-provisioned to satisfy the stringent latency targets of latency-sensitive (LS) jobs. The GPU under-utilization provides a strong incentive to share GPUs among LS jobs and batch jobs. Preemptive GPU prioritization is costly due to large contexts. Many novel GPU preemption techniques have been proposed, exhibiting different trade-offs among preemption latency and overhead. Prior works also propose collaborative methods, which intelligently select the preemption techniques at preemption time. However, GPU kernels usually adopt code transformation to improve performance, which also impacts the preemption costs. As kernel transformation is performed before launching, the preemption technique choices are also determined then. It is impractical to select a preemption technique arbitrarily at preemption time if code transformation is adopted. This paper presents CPSpatial, which combines GPU preemption techniques via GPU spatial multitasking. It proposes preemption hierarchy and SM-prefetching to achieve both low latency and high throughput. Evaluations show that CPSpatial also has zero preemption latency like the traditional instant preemption technique, and at the time, achieves up to 1.43 times throughput. When dealing with sudden LS job workload increasing, CPSpatial reduces the preemption latency by 87.3% compared with the state-of-the-art GPU context switching method.

09:50
A Fixed-Parameter Algorithm for Scheduling Unit dependent Tasks with Unit Communication Delays
PRESENTER: Ning Tang

ABSTRACT. This paper considers the minimization of the makespan for a set of dependent tasks with unit duration and unit communication delays. Given an upper bound of the makespan, release dates and deadlines of the tasks can be computed and define time windows. We prove that our scheduling problem is fixed-parameter tractable; the parameter is the maximum number of tasks that are schedulable at a same time considering time windows. A fixed-parameter algorithm based on a dynamic programming approach is developed and proved to solve this optimization problem. This is, as far as we know, the first fixed-parameter algorithms for a scheduling problem with communication delays.

10:10
Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers
PRESENTER: Jan Kopański

ABSTRACT. The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept—an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate that the lack of burst buffer reservations in backfilling may significantly deteriorate scheduling. We also show that these algorithms can be easily extended to support burst buffers. Finally, we propose a burst-buffer–aware plan-based scheduling algorithm with simulated annealing optimisation, which improves the mean waiting time by over 20% and mean bounded slowdown by 27% compared to the burst-buffer–aware SJF-EASY-backfilling.

10:30
A log-linear (2+5/6)-approximation algorithm for parallel machine scheduling with a single orthogonal resource
PRESENTER: Adrian Naruszko

ABSTRACT. As the gap between compute and I/O performance tends to grow, modern High-Performance Computing (HPC) architectures include a new resource type: an intermediate persistent fast memory layer, called burst buffers. This is just one of many kinds of renewable resources which are orthogonal to the processors themselves, such as network bandwidth or software licenses. Ignoring orthogonal resources while making scheduling decisions just for processors may lead to unplanned delays of jobs of which resource requirements cannot be immediately satisfied. We focus on a classic problem of makespan minimization for parallel-machine scheduling of independent sequential jobs with additional requirements on the amount of a single renewable orthogonal resource. We present an easily-implementable log-linear algorithm that we prove is (2 5/6)-approximation. In simulation experiments, we compare our algorithm to standard greedy list-scheduling heuristics and show that, compared to LPT, resource-based algorithms generate significantly shorter schedules.

11:00-12:30 Session 17: Regular Papers 5: Scheduling & Load Balancing II
11:00
An MPI-Parallel Algorithm for Mapping Complex Networks onto Hierarchical Architectures
PRESENTER: Maria Predari

ABSTRACT. Processing massive application graphs on distributed memory systems requires to map the graphs onto the system's processing elements (PEs). This task becomes all the more important when PEs have non-uniform communication costs or the input is highly irregular. Typically, mapping is addressed using partitioning, in a two-step approach or an integrated one. Parallel partitioning tools do exist; yet, corresponding mapping algorithms or their public implementations all have major sequential parts or other severe scaling limitations.

In this paper, we propose a parallel algorithm that maps graphs onto the PEs of a hierarchical system. Our solution integrates partitioning and mapping; it models the system hierarchy in a concise way as an implicit labeled tree. The vertices of the application graph are labeled as well, and these vertex labels induce the mapping. The mapping optimization follows the basic idea of parallel label propagation, but we tailor the gain computations of label changes to quickly account for the induced communication costs. Our MPI-based code is the first public implementation of a parallel graph mapping algorithm; to this end, we extend the partitioning library ParHIP. To evaluate our algorithm's implementation, we perform comparative experiments with complex networks in the million- and billion-scale range. In general our mapping tool shows good scalability on up to a few thousand PEs. Compared to other MPI-based competitors, our algorithm achieves the best speed to quality trade-off and our quality results are even better than non-parallel mapping tools.

11:20
Taming Tail Latency in Key-Value Stores: a Scheduling Perspective
PRESENTER: Anthony Dugois

ABSTRACT. Distributed key-value stores employ replication for high availability. Yet, they do not always efficiently take advantage of the availability of multiple replicas for each value, and read operations often exhibit high tail latencies. Various replica selection strategies have been proposed to address this problem, together with local request scheduling policies. It is difficult, however, to determine what is the absolute performance gain each of these strategies can achieve. We present a formal framework allowing the systematic study of request scheduling strategies in key-value stores. We contribute a definition of the optimization problem related to reducing tail-latency in a replicated key-value store as a minimization problem with respect to the maximum weighted flow criterion. By using scheduling theory, we show the difficulty of this problem, and therefore the need to develop performance guarantees. We also study the behavior of heuristic methods using simulations, which highlight which properties are useful for limiting tail-latency: for example, the EFT strategy—which uses the earliest available time of servers—exhibits a tail-latency more than twice lower compared to state-of-the-art strategies and often matches lower bound. Our study also illustrates the importance to consider other metrics, such as the stretch, to properly evaluate a replica selection strategy and a local execution policy.

11:40
Pipelined Model Parallelism: Complexity Results and Memory Considerations
PRESENTER: Alena Shilova

ABSTRACT. The training phase in Deep Neural Networks has become an important source of computing resource usage and the resulting volume of computation makes it crucial to perform efficiently on parallel architectures. Data parallelism is the most widely used method, but it requires to replicate the network weights on all devices, and to perform collective communications of the network weights. In this context, model parallelism is an attractive alternative, in which the different layers of the network are distributed over the computing devices. Indeed, it is expected to better distribute weights (to cope with memory problems) and it eliminates the need for large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined approach, which in turn induces new memory costs. In this paper, our goal is to formalize pipelined model parallelism as a scheduling problem, to establish its complexity, and to analyze the consequences of the assumptions that are typically performed in practical solutions such as Pipedream.

12:00
Enhancing Load-Balancing of MPI Applications with Workshare
PRESENTER: Julien Jaeger

ABSTRACT. Some high-performance parallel applications (e.g., simulation codes) are, by nature, prone to computational imbalance. With various elements, such are particles or multiple materials, evolving in a fixed space (with different boundary conditions), an MPI process can easily end up with more operations to perform than its neighbors. This computational imbalance causes performance loss. Load-balancing methods are used to limit such negative impacts. However, most load-balancing schemes rely on shared-memory models, and those handling MPI load-balancing use too much heavy machinery for efficient intra-node load-balancing.

In this paper, we present the MPI Workshare concept. With MPI Workshare, we propose a programming interface based on directives, and the associated runtime, to leverage light MPI intra-node load-balancing. In this work, we focus on loop worksharing. The similarity of our directives with OpenMP ones makes our interface easy to understand and to use. We provide an implementation of both the runtime and compiler directive support. Experimental results on well-known mini-applications (MiniFE, LULESH) show that MPI Workshare succeeds in maintaining the same level of performance as well-balanced workloads even with high imbalance parameter values.

12:30-14:00Lunch Break
14:00-15:30 Session 18: Regular Papers 6: Power & Performance Modeling
14:00
Trace-driven Workload Generation and Execution

ABSTRACT. Although major cloud providers have captured and published workload executions in the form of traces, it is not clear how to use them for workload generation on a wide range of existing platforms. A methodological challenge that remains is to generate and execute realistic datacenter workloads on any infrastructure, using information from available traces. In this paper, we propose Tracie, a methodology addressing this challenge, and introduce a pair of tools supporting its implementation. We present the methodology and all steps of workload generation: analysis of data-center traces, extraction of parameters, application selection, and scaling of a workload to match the capabilities of the underlying infrastructure. Our evaluation validates that Tracie can generate executable workloads that closely resemble their trace-based counterparts. For validation, we correlate the recorded system metrics of a trace against the actual execution’s measured ones. We find that the average system metrics of synthetic workloads differ at most 5% compared to the ones in the trace and that they are highly correlated above 80% in all cases.

14:20
Update on the Asymptotic Optimality of LPT
PRESENTER: Redouane Elghazi

ABSTRACT. When independent tasks are to be scheduled onto identical processors, the typical goal is to minimize the makespan. A simple and efficient heuristic consists in scheduling first the task with the longest processing time (LPT heuristic), and to plan its execution as soon as possible. While the performance of LPT has already been largely studied, in particular its asymptotic performance, we revisit results and propose a novel analysis for the case of tasks generated through uniform integer compositions. Also, we perform extensive simulations to empirically assess the asymptotic performance of LPT. Results demonstrate that the absolute error rapidly tends to zero for several distributions of task costs, including ones studied by theoretical models, and realistic distributions coming from benchmarks.

14:40
E2EWatch: An End-to-end Anomaly Diagnosis Framework for Production HPC Systems
PRESENTER: Burak Aksar

ABSTRACT. In today's High-Performance Computing (HPC) systems, application performance variations are among the most vital challenges as they adversely affect system efficiency, application performance, and cost. System administrators need to identify the anomalies that are responsible for performance variation and take mitigating actions. One can perform manual root-cause analysis on telemetry data collected by HPC monitoring infrastructures to analyze performance variations. However, manual analysis methods are time-intensive and limited in impact due to the increasing complexity of HPC systems and terabyte/day-sized telemetry data. State-of-the-art approaches use machine learning-based methods to diagnose performance anomalies automatically. This paper deploys an end-to-end machine learning framework that diagnoses performance anomalies on compute nodes on a 1488-node production HPC system. We demonstrate job and node-level anomaly diagnosis results with the Grafana frontend interface at runtime. Furthermore, we discuss challenges and design decisions for the deployment.

15:00
Sustaining Performance While Reducing Energy Consumption: A Control Theory Approach
PRESENTER: Raphaël Bleuse

ABSTRACT. Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach---which is novel in computing resource management---consists of periodically monitoring application progress %adapting to phases, and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark.

15:30-16:30 Session 19: Keynote II
15:30
HPC for Bioinformatics: The Genetic Sequence Comparison Quest for Performance

ABSTRACT. Genetic Sequence Comparison is an important operation in Bioinformatics, executed routinely worldwide. Two relevant algorithms that compare genetic sequences are the Smith-Waterman (SW) algorithm and Sankoff’s algorithm. The Smith-Waterman algorithm is widely used for pairwise comparisons and it obtains the optimal result in quadratic time - O(n2), where n is the length of the sequences. The Sankoff algorithm is used to structurally align two sequences and it computes the optimal result in O(n4) time. In order to accelerate these algorithms, many parallel strategies were proposed in the literature. However, the alignment of whole chromosomes with hundreds of millions of characters with the SW algorithm is still a very challenging task, which requires extraordinary computing power. Likewise, obtaining the structural alignment of two sequences with the Sankoff algorithm requires parallel approaches. In this talk, we first present our MASA-CUDAlign tool, which was used to pairwise align real DNA sequences with up to 249 millions of characters in a cluster with 512 GPUs, achieving the best performance in the literature in 2021. We will present and discuss the innovative features of the most recent version of MASA-CUDAlign: parallelogram execution, incremental speculation, block pruning and score-share balancing strategies. We will also show performance and energy results in homogeneous and heterogeneous GPU clusters. Then, we will discuss the design of our CUDA-Sankoff tool and its innovative strategy to exploit multi-level wavefront parallelism. At the end, we will show a covid-19 case study, where we use the tools discussed in this talk to compare the SARS-CoV-2 genetic sequences, considering the reference sequence and its variants.

16:30-18:00 Session 20: Regular Papers 7: Theory & Algorithms
16:30
Algorithm design for Tensor Units

ABSTRACT. To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

16:50
TSLQueue: An Efficient Lock-free Design for Priority Queues
PRESENTER: Adones Rukundo

ABSTRACT. Priority queues are fundamental abstract data structures, often used to manage limited resources in parallel systems. Typical proposed parallel priority queue implementations are based on heaps or skip lists. In recent literature, skip lists have been shown to be the most efficient design choice for implementing priority queues. Though numerous intricate implementations of skip list based queues have been proposed in the literature, their performance is constrained by the high number of global atomic updates per operation and the high memory utilization, which are logarithmically proportional to the number of items in the skip list. In this paper, we propose an alternative approach for designing priority queues, that significantly improves memory efficiency, throughput and latency performance, by reducing the number of global atomic updates and memory consumption as compared to skip list based queues. To achieve this, our new design combines two structures; a search tree and a linked list, forming what we call a Tree Search List Queue (TSLQueue). The leaves of the tree are linked together to form a linked list of leaves with a head as an access point. Analytically, a skip list based queue insert or delete operation has at worst case O(log n) global atomic updates, where n is the size of the queue. While the TSLQueue insert or delete operations require only 2 or 3 global atomic updates respectively. When it comes to memory consumption, TSLQueue exhibits O(n) memory consumption, compared to O(n log n) worst case for a skip list based queue, making the TSLQueue more memory efficient than a skip list based queue of the same size. We experimentally show, that TSLQueue significantly outperforms the best previous proposed skip list based queues, with respect to throughput and latency in general, exhibiting more than 50% improvement in the cases of asymmetric workloads.

17:10
A Scalable Approximation Algorithm for Weighted Longest Common Subsequence
PRESENTER: Thomas Lavastida

ABSTRACT. This work introduces novel parallel methods for weighted longest common subsequence (WLCS) and its generalization, all-substrings WLCS. Previous work developed efficient algorithms for these problems via Monge matrix multiplication, which is a limiting factor for further improvement. Diverging from these approaches, we relax the algorithm's optimality guarantee in a controlled way, using a different, natural dynamic program which can be sketched and solved in a divide-and-conquer manner that is efficient to parallelize.

Additionally, to compute the base case of our algorithm, we develop a novel and efficient method for all-substrings WLCS inspired by previous work on unweighted all-substrings LCS, exploiting the typically small range of weights.

Our method fits in most parallel models of computation, including the PRAM and the BSP model. To the best of our knowledge this is the fastest $(1-\eps)$-approximation algorithm for all-substrings WLCS and WLCS in BSP. Further, this is the asymptotically fastest parallel algorithm for weighted LCS as the number of processors increases.

17:30
G-Morph: Induced Subgraph Isomorphism Search of Labeled Graphs on a GPU
PRESENTER: Bryan Rowe

ABSTRACT. Induced subgraph isomorphism search finds the occurrences of embedded subgraphs within a single large data graph that are strictly isomorphic to a given query graph. Labeled graphs contain object types and are a primary input source to this core search problem, which applies to systems like graph databases for answering queries. In recent years, researchers have employed GPU parallel solutions to this problem to help accelerate runtimes by utilizing the filtering-and-joining framework, which first filters vertices that cannot be part of the solution then joins partial solutions with candidate edges until full isomorphisms are determined. However, the performance of current GPU-based solutions is hindered by limited filtering effectiveness and presence of extraneous computations. This paper presents G-Morph, a fast GPU-based induced subgraph isomorphism search system for labeled graphs. Our filtering-and-joining system parallelizes both phases by upfront eliminating irrelevant vertices via a novel space-efficient vertex signature hashing strategy and efficiently joining partial solutions through use of a novel sliding window algorithm (slide-join) that provides overflow protection for larger input. Together these techniques greatly reduce extraneous computations by reducing Cartesian products and improving edge verification while supporting large scan operations (split-scan). G-Morph outperforms the state-of-the-art GPU-based GSI and CPU-based VF3 systems on labeled real-world graphs achieving speedups of up to 15.78x and 43.56x respectively.