View: session overviewtalk overview
11:00 | ALONA: Automatic Loop Nest Approximation with Reconstruction and Space Pruning PRESENTER: Daniel Maier ABSTRACT. Approximate computing comprises a large variety of techniques that trade the accuracy of an application's output for other metrics such as computing time or energy cost. Many existing approximation techniques focus on loops such as loop perforation, which skips iterations for faster, approximated computation. This paper introduces ALONA, a novel approach for automatic loop nest approximation based on a polyhedral compilation. ALONA’s compilation framework applies a sequence of loop approximation transformations, generalizes state-of-the-art perforation techniques, and introduces new multi-dimensional approximation schemes. The framework includes a reconstruction technique that significantly improves the accuracy of the approximations, and a transformation space pruning method based on Barvinok’s counting that allows for efficient automatic tuning. Evaluated on a collection of nineteen PolyBench applications and three Paraprox applications, ALONA discovers new approximations that are better than state-of-the-art techniques in both approximation accuracy and performance. |
11:20 | Automatic low-overhead load-imbalance detection in MPI applications PRESENTER: Peter Arzt ABSTRACT. Load imbalances are a major reason for efficiency loss in highly parallel applications. Hence, their identification is of high relevance in performance analysis and tuning. We present a low-overhead approach to automatically identify load-imbalanced regions and filter out irrelevant ones based on new selection heuristics in our PIRA tool for automatic instrumentation refinement for the Score-P measurement system. For the LULESH mini-app as well as the Ice-sheet and Sea-level System Model simulation package we, thus, correctly identify existing load imbalances while maintaining a runtime overhead of less than 10% for all but one input. Moreover, the traces generated are suitable for Scalasca's automatic trace analysis. |
11:40 | Smart Distributed DataSets for Stream Processing PRESENTER: Miguel Coimbra ABSTRACT. There is an ever-increasing amount of devices getting connected to the internet, and so is the volume of data that needs to be processed - the Internet-of-Things (IoT) is a good example of this. Stream processing was created for the sole purpose of dealing with high volumes of data, and it has proven itself time and time again as a successful approach. However, there is still a necessity to further improve scalability and performance on this type of system. This work presents SDD4Streaming, a solution aimed at solving these specific issues of stream processing engines. Although current engines already implement scalability solutions, time has shown those are not enough and that further improvements are needed. SDD4Streaming employs an extension of a system to improve resource usage, so that applications use the resources they need to process data in a timely manner, thus increasing performance and helping other applications that are running in parallel in the same system. |
14:00 | Colony: Parallel Functions as a Service on the Cloud-Edge Continuum PRESENTER: Francesc-Josep Lordan Gomis ABSTRACT. Smart devices markets are rapidly increasing their sales figures; however, the computing power available on such devices is not sufficient to provide good-enough-quality services. This paper proposes a solution to organize the devices within the Cloud-Edge Continuum in such a way that each one, as an autonomous individual -Agent-, processes events/data on its embedded compute resources while offering its computing capacity to the rest of the infrastructure in a Function-as-a-Service manner. By transparently converting the logic of the computation into a task-based workflow, agents host the execution of the method while offloading part of the workload onto other agents to improve the overall service performance. Thus, developers can efficiently code applications of any of the three envisaged computing scenarios -- sense-process-actuate, streaming and batch processing -- throughout the whole Cloud-Edge Continuum without struggling with different frameworks specifically designed for each of them. |
14:20 | Horizontal Scaling in Cloud using Contextual Bandits PRESENTER: David Delande ABSTRACT. One characteristic of the Cloud is elasticity: it provides the ability to adapt resources allocated to applications as needed at runtime. This capacity relies on scaling and scheduling. In this article online horizontal scaling is studied. The aim is to determine dynamically applications deployment parameters and to adjust them in order to respect a quality of service level without any human parameters tuning. We focus on CaaS (container-based) environments and propose an algorithm based on contextual bandits (HSLinUCB). We evaluate our proposal on a simulated platform and on a real Kubernetes’s platform. We compare it with several baselines: threshold based auto-scaler, Q-Learning, and Deep Q-Learning. The results show that HSLinUCB gives very good results compared to other baselines, even when used without any training period. |
14:40 | Geo-Distribute Cloud Applications at the Edge PRESENTER: Marie Delavergne ABSTRACT. With the arrival of the edge computing a new challenge arise for cloud applications: How to benefit from geo-distribution (locality) while dealing with inherent constraints of wide-area network links? The admitted approach consists in modifying cloud applications by entangling geo-distribution aspects in the business logic using distributed data stores. However, this makes the code intricate and contradicts the software engineering principle of externalizing concerns. We propose a different approach that relies on the modularity property of microservices applications: (i) one instance of an application is deployed at each edge location, making the system more robust to network partitions (local requests can still be satisfied), and (ii) collaboration between instances can be programmed outside of the application in a generic manner thanks to a service mesh. We validate the relevance of our proposal on a real use-case: geo-distributing OpenStack, a modular application composed of 13 million of lines of code and more than 150 services. |
15:00 | A Fault Tolerant and Deadline Constrained Sequence Alignment Application on Cloud-based Spot GPU Instances PRESENTER: Rafaela Brum ABSTRACT. Pairwise sequence alignment is an important application to identify regions of similarity that may indicate the relationship between two biological sequences. This is a computationally intensive task that usually requires parallel processing to provide realistic execution times. This work introduces a new framework for a deadline constrained application of sequence alignment, called MASA-CUDAlign, that exploits cloud computing with Spot GPU instances. Although much cheaper than on-demand instances, Spot GPUs can be revoked at any time, so the framework is also able to restart MASA-CUDAlign from a checkpoint in a new instance when a revocation occurs. We evaluate the proposed framework considering five pairs of DNA sequences and different AWS instances. Our results show that the framework reduces financial costs when compared to on-demand GPU instances while meeting the deadlines even in scenarios with several instances revocations. |
16:30 | Particle-In-Cell Simulation using Asynchronous Tasking PRESENTER: Nicolas Guidotti ABSTRACT. Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronise the tasks. However, tasking models are yet to be widely adopted by the HPC community and their eective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only signicantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores. |
16:50 | Efficient GPU Computation using Task Graph Parallelism PRESENTER: Dian-Lun Lin ABSTRACT. Recently, CUDA introduces a new task graph programming model, CUDA Graph, to enable efficient launch and execution of GPU work. Users describe a GPU workload in a task graph rather than aggregated GPU operations, allowing the CUDA runtime to perform whole-graph optimization and significantly reduce the kernel call overheads. However, programming CUDA graphs is extremely challenging. Users need to explicitly construct a graph with verbose parameter settings or implicitly capture a graph that requires complex dependency and concurrency management using streams and events. To overcome this challenge, we introduce a lightweight task graph programming framework to enable efficient GPU computation using CUDA Graph. Users can focus on high-level development of dependent GPU operations while leaving all the intricate managements of stream concurrency and event dependency to our optimization algorithm. We have evaluated our framework and demonstrated its promising performance on both micro-benchmarks and a large-scale machine learning workload. The result also shows that our optimization algorithm achieves very comparable performance to an optimally constructed graph and consumes much less GPU resource. |
17:10 | Accelerating Graph Applications Using Phased Transactional Memory PRESENTER: Catalina Munoz Morales ABSTRACT. Due to their fine-grained operations and low conflict rates, graph processing algorithms expose a large amount of parallelism that has been extensively exploited by various parallelization frameworks. Transactional Memory (TM) is a programming model that uses an optimistic concurrency control mechanism to improve the performance of irregular applications, making it a perfect candidate to extract parallelism from graph-based programs. Although fast Hardware TM (HTM) instructions are now available in the ISA extensions of some major processor architectures (e.g., Intel and ARM), balancing the usage of Software TM (STM) and HTM to compensate for capacity and conflict aborts is still a challenging task. This paper presents a Phased TM implementation for graph applications, called Graph-Oriented Transactional Memory (GoTM). It uses a three-state (HTM, STM, GLOCK) concurrency control automaton that leverages both HTM and STM implementations to speed-up graph applications. Experimental results using seven well-known graph programs and real-life workloads show that GoTM can outperform other Phased TM systems and lock-based concurrency mechanisms such as the one present in Galois, a state-of-the-art framework for graph computations. |
17:30 | Towards High Performance Resilience using Performance Portable Abstractions PRESENTER: Nicolas Morales ABSTRACT. In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code. |
18:00 | Closing of the Euro-Par 2021 |
18:15 | Presenting the Euro-Par 2022 |