previous day
all days

View: session overviewtalk overview

09:30-11:00 Session 21: EU Projects: European Initiative Projects Towards Exascale Computing
European Processor Initiative and EU Projects Towards Exascale Computing

ABSTRACT. The European Processor Initiative (EPI) is a project currently implemented under the first stage of the Framework Partnership Agreement signed by the Consortium with the European Commission, whose aim is to design and implement a roadmap for a new family of low-power European processors for extreme scale computing, high-performance Big-Data and a range of emerging applications.

The project intends to deliver a high-performance, low-power processor, implementing vector instructions and specific accelerators with high bandwidth memory access. The EPI processor will also meet high security and safety requirements. This will be achieved through intensive use of simulation, development of a complete software stack and tape-out in the most advanced semiconductor process node. SGA1 will provide a competitive chip that can effectively address the requirements of the HPC, AI, automotive and trusted IT infrastructure markets.

EPEEC: Europe toward High Coding Productivity for Exascale

ABSTRACT. EPEEC’s main goal is to develop and deploy a production-ready parallel programming environment that turns upcoming overwhelmingly-heterogeneous exascale supercomputers into manageable platforms for domain application developers. The consortium will significantly advance and integrate existing state-of-the-art components based on European technology (programming models, runtime systems, and tools) with key features enabling 3 overarching objectives: high coding productivity, high performance, and energy awareness.

Project Objectives: 

1. High coding productivity: A set of tools that can exploit the full power of the emerging hardware by turning them into manageable platforms for domain application developers;

2. High Performance: A programming environment with all relevant functionality at TRL8 for current pre-exascale systems and TRL4 for exascale platforms;

3. Energy Awareness: efficient and energy-aware management of hardware heterogeneity, both in terms of processing elements and memory subsystems further favouring coding productivity.

SparCity: Optimizing Sparse Computation and Graphs for Novel Parallel Architectures

ABSTRACT. The SparCity project aims at creating a supercomputing framework that will provide efficient algorithms and coherent tools specifically designed for maximising the performance and energy efficiency of sparse computations on emerging HPC systems, while also opening up new usage areas for sparse computations in data analytics and deep learning.

Project Objectives: 

1. Develop a comprehensive application and data characterization mechanism and orchestrate an advanced and synergistic software optimization process, based on the state-of-the-art analytical and machine-learning-based performance and energy models;

2. Develop advanced node-level static and dynamic code optimizations designed for massive and heterogeneous parallel architectures with complex memory hierarchy and exploit mixed-precision opportunities for sparse computation;

3. Devise topology-aware partitioning algorithms and optimizations to minimize the communication overhead and boost the efficiency of system-level parallelism;

4. Create digital SuperTwins of supercomputers to evaluate and simulate what-if hardware scenarios and to gather real-time performance and energy intel from node- and system-level components for application optimization on the current and future hardware;

5. Demonstrate the effectiveness and usability of the SparCity framework by enhancing the computing scale and energy efficiency of four challenging real-life applications, namely, computational cardiology, social network analysis, bioinformatics and computer vision applications;

6. Deliver a robust, well-supported and documented SparCity framework into the hands of computational scientists, data analysts, and deep learning end-users from industry and academia.

11:00-12:30 Session 22: Regular Papers 8: Tools & Environments
ALONA: Automatic Loop Nest Approximation with Reconstruction and Space Pruning
PRESENTER: Daniel Maier

ABSTRACT. Approximate computing comprises a large variety of techniques that trade the accuracy of an application's output for other metrics such as computing time or energy cost. Many existing approximation techniques focus on loops such as loop perforation, which skips iterations for faster, approximated computation. This paper introduces ALONA, a novel approach for automatic loop nest approximation based on a polyhedral compilation. ALONA’s compilation framework applies a sequence of loop approximation transformations, generalizes state-of-the-art perforation techniques, and introduces new multi-dimensional approximation schemes. The framework includes a reconstruction technique that significantly improves the accuracy of the approximations, and a transformation space pruning method based on Barvinok’s counting that allows for efficient automatic tuning. Evaluated on a collection of nineteen PolyBench applications and three Paraprox applications, ALONA discovers new approximations that are better than state-of-the-art techniques in both approximation accuracy and performance.

Automatic low-overhead load-imbalance detection in MPI applications

ABSTRACT. Load imbalances are a major reason for efficiency loss in highly parallel applications. Hence, their identification is of high relevance in performance analysis and tuning. We present a low-overhead approach to automatically identify load-imbalanced regions and filter out irrelevant ones based on new selection heuristics in our PIRA tool for automatic instrumentation refinement for the Score-P measurement system. For the LULESH mini-app as well as the Ice-sheet and Sea-level System Model simulation package we, thus, correctly identify existing load imbalances while maintaining a runtime overhead of less than 10% for all but one input. Moreover, the traces generated are suitable for Scalasca's automatic trace analysis.

Smart Distributed DataSets for Stream Processing
PRESENTER: Miguel Coimbra

ABSTRACT. There is an ever-increasing amount of devices getting connected to the internet, and so is the volume of data that needs to be processed - the Internet-of-Things (IoT) is a good example of this. Stream processing was created for the sole purpose of dealing with high volumes of data, and it has proven itself time and time again as a successful approach. However, there is still a necessity to further improve scalability and performance on this type of system.

This work presents SDD4Streaming, a solution aimed at solving these specific issues of stream processing engines. Although current engines already implement scalability solutions, time has shown those are not enough and that further improvements are needed. SDD4Streaming employs an extension of a system to improve resource usage, so that applications use the resources they need to process data in a timely manner, thus increasing performance and helping other applications that are running in parallel in the same system.

12:30-14:00Lunch Break
14:00-15:30 Session 23: Regular Papers 9: Cloud & Edge Computing
Colony: Parallel Functions as a Service on the Cloud-Edge Continuum

ABSTRACT. Smart devices markets are rapidly increasing their sales figures; however, the computing power available on such devices is not sufficient to provide good-enough-quality services.

This paper proposes a solution to organize the devices within the Cloud-Edge Continuum in such a way that each one, as an autonomous individual -Agent-, processes events/data on its embedded compute resources while offering its computing capacity to the rest of the infrastructure in a Function-as-a-Service manner. By transparently converting the logic of the computation into a task-based workflow, agents host the execution of the method while offloading part of the workload onto other agents to improve the overall service performance. Thus, developers can efficiently code applications of any of the three envisaged computing scenarios -- sense-process-actuate, streaming and batch processing -- throughout the whole Cloud-Edge Continuum without struggling with different frameworks specifically designed for each of them.

Horizontal Scaling in Cloud using Contextual Bandits
PRESENTER: David Delande

ABSTRACT. One characteristic of the Cloud is elasticity: it provides the ability to adapt resources allocated to applications as needed at runtime. This capacity relies on scaling and scheduling. In this article online horizontal scaling is studied. The aim is to determine dynamically applications deployment parameters and to adjust them in order to respect a quality of service level without any human parameters tuning. We focus on CaaS (container-based) environments and propose an algorithm based on contextual bandits (HSLinUCB). We evaluate our proposal on a simulated platform and on a real Kubernetes’s platform. We compare it with several baselines: threshold based auto-scaler, Q-Learning, and Deep Q-Learning. The results show that HSLinUCB gives very good results compared to other baselines, even when used without any training period.

Geo-Distribute Cloud Applications at the Edge
PRESENTER: Marie Delavergne

ABSTRACT. With the arrival of the edge computing a new challenge arise for cloud applications: How to benefit from geo-distribution (locality) while dealing with inherent constraints of wide-area network links? The admitted approach consists in modifying cloud applications by entangling geo-distribution aspects in the business logic using distributed data stores. However, this makes the code intricate and contradicts the software engineering principle of externalizing concerns. We propose a different approach that relies on the modularity property of microservices applications: (i) one instance of an application is deployed at each edge location, making the system more robust to network partitions (local requests can still be satisfied), and (ii) collaboration between instances can be programmed outside of the application in a generic manner thanks to a service mesh. We validate the relevance of our proposal on a real use-case: geo-distributing OpenStack, a modular application composed of 13 million of lines of code and more than 150 services.

A Fault Tolerant and Deadline Constrained Sequence Alignment Application on Cloud-based Spot GPU Instances
PRESENTER: Rafaela Brum

ABSTRACT. Pairwise sequence alignment is an important application to identify regions of similarity that may indicate the relationship between two biological sequences. This is a computationally intensive task that usually requires parallel processing to provide realistic execution times. This work introduces a new framework for a deadline constrained application of sequence alignment, called MASA-CUDAlign, that exploits cloud computing with Spot GPU instances. Although much cheaper than on-demand instances, Spot GPUs can be revoked at any time, so the framework is also able to restart MASA-CUDAlign from a checkpoint in a new instance when a revocation occurs. We evaluate the proposed framework considering five pairs of DNA sequences and different AWS instances. Our results show that the framework reduces financial costs when compared to on-demand GPU instances while meeting the deadlines even in scenarios with several instances revocations.

15:30-16:30 Session 24: Keynote III
Knowledge Graphs, Graph AI, and the Need for High-performance Graph Computing

ABSTRACT. Knowledge Graphs now power many applications across diverse industries such as FinTech, Pharma and Manufacturing. Data volumes are growing at a staggering rate, and graphs with hundreds of billions edges are not uncommon. Computations on such data sets include querying, analytics, and pattern mining, and there is growing interest in using machine learning to perform inference on large graphs. In many applications, it is necessary to combine these operations seamlessly to extract actionable intelligence as quickly as possible. Katana Graph is a start-up based in Austin and the Bay Area that is building a scale-out platform for seamless, high-performance computing on such graph data sets. I will describe the key features of the Katana Graph Engine that enable high performance, some important use cases for this technology from Katana's customers, and the main lessons I have learned from doing a startup after a career in academia.

16:30-18:00 Session 25: Regular Papers 10: Programming & Languages
Particle-In-Cell Simulation using Asynchronous Tasking
PRESENTER: Nicolas Guidotti

ABSTRACT. Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronise the tasks. However, tasking models are yet to be widely adopted by the HPC community and their eective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only signicantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.

Efficient GPU Computation using Task Graph Parallelism

ABSTRACT. Recently, CUDA introduces a new task graph programming model, CUDA Graph, to enable efficient launch and execution of GPU work. Users describe a GPU workload in a task graph rather than aggregated GPU operations, allowing the CUDA runtime to perform whole-graph optimization and significantly reduce the kernel call overheads. However, programming CUDA graphs is extremely challenging. Users need to explicitly construct a graph with verbose parameter settings or implicitly capture a graph that requires complex dependency and concurrency management using streams and events. To overcome this challenge, we introduce a lightweight task graph programming framework to enable efficient GPU computation using CUDA Graph. Users can focus on high-level development of dependent GPU operations while leaving all the intricate managements of stream concurrency and event dependency to our optimization algorithm. We have evaluated our framework and demonstrated its promising performance on both micro-benchmarks and a large-scale machine learning workload. The result also shows that our optimization algorithm achieves very comparable performance to an optimally constructed graph and consumes much less GPU resource.

Accelerating Graph Applications Using Phased Transactional Memory

ABSTRACT. Due to their fine-grained operations and low conflict rates, graph processing algorithms expose a large amount of parallelism that has been extensively exploited by various parallelization frameworks. Transactional Memory (TM) is a programming model that uses an optimistic concurrency control mechanism to improve the performance of irregular applications, making it a perfect candidate to extract parallelism from graph-based programs. Although fast Hardware TM (HTM) instructions are now available in the ISA extensions of some major processor architectures (e.g., Intel and ARM), balancing the usage of Software TM (STM) and HTM to compensate for capacity and conflict aborts is still a challenging task. This paper presents a Phased TM implementation for graph applications, called Graph-Oriented Transactional Memory (GoTM). It uses a three-state (HTM, STM, GLOCK) concurrency control automaton that leverages both HTM and STM implementations to speed-up graph applications. Experimental results using seven well-known graph programs and real-life workloads show that GoTM can outperform other Phased TM systems and lock-based concurrency mechanisms such as the one present in Galois, a state-of-the-art framework for graph computations.

Towards High Performance Resilience using Performance Portable Abstractions
PRESENTER: Nicolas Morales

ABSTRACT. In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code.

18:00-18:30 Session 26: Closing Session
Closing of the Euro-Par 2021
Presenting the Euro-Par 2022