View: session overviewtalk overview
KEYNOTE: "Accelerating AI at the Edge: The Power of Efficient Hardware-Software Co-Design "
José Cano
University of Glasgow
09:00 | Efficient quantum computing to solve case studies in science and engineering PRESENTER: Laura María Donaire ABSTRACT. Quantum computing emerges as a pivotal solution to classical computing limitations in the post-Moore era, offering superior capabilities through qubits' unique properties of superposition and entanglement. Actually, we are in the Noisy Intermediate-Scale Quantum era where quantum computing faces challenges such as limited resources and fault-tolerant circuit implementation. Scarce development resources hinder quantum algorithm and circuit progress, necessitating optimized compilers and methodologies. Presently, two dominant paradigms, general-purpose quantum computing and adiabatic computing, drive quantum advancements, with notable platforms including IBM's Qiskit and D-Wave. The ongoing research focuses on optimizing quantum circuits, particularly through novel gate implementations like Peres, TR, and GN gates, aimed at reducing T-count and T-depth. Preliminary results demonstrate promising reductions in these metrics, enhancing computational efficiency while adapting to the constraints of real quantum platforms. These endeavors aim to foster a repository of quantum routines, facilitating broader community access to quantum computational resources. |
09:15 | PRESENTER: Arun Thangamani ABSTRACT. The state-of-the-art source-to-source polyhedral schedulers annotate loops that can be vectorized with directives, which are merely recommendations to the compiler. However, standard compilers auto-vectorizers may fail to vectorize them because of the complexity of the loops structure or nested statements in the restructured code. The Polygeist compilation framework can generate polyhedral optimized (tiling and parallel loops) MLIR code, but it neither annotates the loops with vectorization directives nor auto-generates the vectorized code. In this paper we describe a proposal to extend Polygeist to generate OpenMP SIMD MLIR code for vector loops. We also want to further extend the code generation process to support GPU MLIR code thereby targeting accelerated architectures. |
09:30 | Mapping code on Coarse Grained Reconfigurable Arrays using a SAT solver PRESENTER: Cristian Tirelli ABSTRACT. Emerging low-powered architectures like Coarse-Grain Re- configurable Arrays (CGRAs) are becoming more common. Often included as co-processors, they are used to accelerate compute-intensive workloads like loops. The speedup obtained is defined by the hardware design of the accelerator and by the quality of the compilation. State of the art (SoA) compilation techniques leverage modulo scheduling to minimize the Iteration Interval (II), exploit the architecture parallelism and, consequentially, reduce the execution time of the accelerated workload. In our work, we focus on improving the compilation process by finding the lowest II for any given topology through a satisfiability (SAT) formulation of the mapping problem. We introduce a novel schedule, called Kernel Mobility Schedule, to encode all the possible mappings for a given Data Flow Graph (DFG) and for a given II. The schedule is used together with the CGRA architectural information to generate all the constraints necessary to find a valid mapping. Experimental results demonstrate that our method not only reduces compilation time on average but also achieves higher quality mappings compared to existing SoA techniques. |
09:45 | Dynamic Load Balancing for Non-Spatial Agent-Based Models PRESENTER: Cristina Peralta Quesada ABSTRACT. Distributed agent-based modeling (ABM) simulations are often deployed in high-performance environments to emulate complex systems. In non-spatial ABM simulations, agent relationships are represented by a graph structure. The distribution of this graph among processing elements (PEs) significantly impacts the performance of the simulation. Moreover, none of the existing distributed ABM frameworks provide dynamic load balancing for non-spatial ABM simulations based on complex networks. This paper proposes a novel hierarchical approach for dynamic load balancing in non-spatial ABM simulations using multilevel graph partitioning. As a first step toward our goal, we implement two distributed algorithms based on the Label Propagation Algorithm (LPA) to be used for the coarsening of the simulation graph. The effectiveness of these approaches is briefly evaluated through performance and quality metrics. |
09:00 | Scientific Workflows in the Continuum Era |
More information of the tutorial program at https://www.danysoft.com/workshop-intel-danysoft-euro-par-2024/.
10:30 | PRESENTER: Cristian Campos ABSTRACT. One of the most time-consuming kernels of an epileptic seizure detection app is the computation of the Dynamic Time Warping (DTW) Distance Matrix. This kernel is a good candidate for heterogeneous CPU/GPU/FPGA execution. In this paper, we explore the design space of heterogeneous CPU, GPU, and FPGA implementations of this kernel. We start by optimizing the CPU implementation of the DTW Distance Matrix computation leveraging the latest C++26 SIMD library and compare it with the SYCL implementation for CPU that also exploits the SIMD units. Next, we take advantage of the portability of SYCL to run the code on an on-chip GPU, iGPU, as well as on a discrete NVIDIA GPU, dGPU. Finally we also present the SYCL implementation of the kernel on an Intel Stratix 10 MX FPGA. Our evaluations demonstrate that SYCL seems well suited to exploit the available SIMD capabilities of modern CPU cores, and also shows promising results for the accelerating devices considered in this work. |
11:00 | Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL PRESENTER: Manuel de Castro ABSTRACT. As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA tooling and its problems. To do so, we evaluate the performance portability of two frameworks for developing FPGA solutions for HPC (SYCL and OpenCL) when using them to port a highly-parallel application to FPGAs, using both ND-range and single-task type of kernels. The developer’s general recommendation when using FPGAs is to develop single-task kernels for them, as they are commonly regarded as more suited for such hardware. However, we discovered that, when using high-level approaches such as OpenCL and SYCL to program a highly-parallel application with no FPGA-tailored optimizations, ND-range kernels significantly outperform single-task codes. Specifically, while SYCL struggles to produce efficient FPGA implementations of applications described as single-task codes, its performance excels with ND-range kernels, a result that was unexpectedly favorable. |
11:30 | Exploring the Limits of Cross-Platform Sparse Tensor Processing PRESENTER: Leonel Sousa ABSTRACT. Sparse tensors are the natural way to store and represent multi-dimensional data, but ensuring their efficient processing is an important open challenge. Some tensor methods, MTTKRP and TTM, pose as major performance bottlenecks for commonly used algorithms in many research areas, especially when considering that most real-world data is highly sparse. State-of-the-art optimisations tend to focus on single-device vendor-specific implementations, which are not applicable to modern heterogeneous systems with several compute devices of different architectures, such as multi-core CPUs, GPUs and FPGAs. To close this gap, in this work, novel portable and highly data-parallel software approaches and specialized FPGA architectures are proposed for the most prominent sparse tensor methods (TTM and MTTKRP), which allow for their efficient cross-platform SYCL-based processing in modern heterogeneous systems at different granularity levels. We also conduct an in-depth analytical and experimental characterization of the processing upper-bounds that these sparse tensor methods can achieve in multi-core CPU, GPU, heterogeneous and FPGA-based processing platforms from different vendors, achieving speedups of up to 6x for TTM and 7x for MTTKRP, when compared to the state-of-the-art approaches, and with the advantage of not being device nor vendor specific. |
12:00 | Containerization for Heterogeneous and Hybrid Parallelism PRESENTER: Daniel Suárez ABSTRACT. This paper investigates the viability and efficiency of parameter search for machine learning (ML) pipelines through containerization in diverse computational environments. We leverage parallelization by executing each test in the grid search as an independent container, allowing for scalable and efficient parameter tuning. Our study encompasses a detailed comparison of energy efficiency, power consumption, and execution times between two distinct setups. The first setup comprises a cluster of eight Raspberry Pi Compute Module 4 units integrated into two Turing Pi boards, each housing four nodes, all managed within a single Kubernetes cluster. This configuration showcases the potential of cost-effective, low-power devices in executing complex ML workflows. The second setup involves a traditional high-performance computing environment with multi-core CPUs and GPUs, where some parameter tests are offloaded to the CPU and others to the GPU, enabling a form of hybrid computing. By juxtaposing these two configurations, we aim to highlight the practical implications of containerized hybrid parallelism, particularly in terms of resource utilization and performance metrics. Our findings reveal insights into the trade-offs and benefits of using lightweight, distributed clusters versus more conventional, centralized computing resources for ML parameter optimization. This study underscores the potential of containerization in enhancing the flexibility and efficiency of ML workflows across varying hardware architectures. |
10:30 | Towards the quantification of performance isolation in HPC PRESENTER: Simon Volpert ABSTRACT. Performance isolation is widely applied in compute heavy and cluster-oriented infrastructures to reduce the impact of the noisy neighbor effect. However, the degree of isolation is widely unclear and relies on operational experience. Cluster operators have strong incentives to get detailed insights on the degree of isolation as they need to guarantee specific SLA, or need to make sure workloads cannot degrade each other. This work presents a method and framework from previous works to determine the isolation capability of isolation approaches in a deterministic and comparable way. These insights can be used for capacity planning, workload consolidation to reduce total cost of ownership, or for the implementation of business models such as cloud computing. We further show some first measurement results and discuss whether performance isolation is applicable to the domain of high-performance computing. |
10:45 | Evolving Deep Learning techniques for disease diagnosis support using medical imaging. PRESENTER: Lara Visuña ABSTRACT. In recent years, Clinical Decision Support Systems (CDSSs) have been developed to support the complex decisions that take place in clinical practice. Despite their rapid evolution, they face many challenges such as speed, interoperability, and precision requirements. Our research explores a collection of tools and technologies to improve the current Deep Learning CDDSs. Our work is focused on medical imaging for pulmonary diseases as a study case. This paper discusses the methodology and techniques implemented during the PhD, such as the database creation, training strategies, and disease prediction with Deep Learning ensembles. All these techniques represent an improvement of the current systems and are highly generalizable to other image techniques and diseases. |
11:00 | On the building of computing continuum systems using serverless abstractions PRESENTER: Dante Sánchez-Gallegos ABSTRACT. The computing continuum is an emerging paradigm where data is processed from the edge to the cloud, passing through the fog. Organizations distribute multiple applications in this paradigm to process large workloads and produce helpful information. Nevertheless, building and managing these systems are complex tasks, requiring the orchestration of heterogeneous data through different computational sites. This paper presents a method to enable organizations to build computing continuum systems using serverless abstractions, allowing applications to be deployed through different infrastructures and their interconnection using generic data channels. This is performed through an architecture divided into four layers: processing, endpoints, data, and control. We conducted a case study based on medical data management to evaluate this method. The experimental evaluation shows the efficiency of our method in comparison with different state-of-the-art tools. |
11:15 | On Selection Functions in Adaptive Routing PRESENTER: Alejandro Cano Cos ABSTRACT. This work constructs traffic patterns that challenge the selection function used in adaptive routing algorithms.By focusing on buffer occupancy, the primary metric in these selection functions, the study demonstrates that buffer occupancy alone is an inadequate criterion for adaptive mechanisms that determine the choice between minimal and non-minimal routing, such as UGAL. This work underscores the need for more robust criteria in the design of adaptive routing algorithms. |
11:30 | Applying Machine Learning for characterizing social networks Agent-based models PRESENTER: Haoyuan Li ABSTRACT. Social media networks are increasingly crucial in our lives, demanding deeper exploration of their dynamics. With billions of users and constant updates, modeling their complexity is challenging. Agent-based modeling (ABM) is a common approach to understanding social network communities, enabling individual behavior definition and system-level simulation. ABM is a potent tool for testing algorithmic impacts on user behavior. Leveraging ABM requires robust data processing and storage capabilities, where High Performance Computing (HPC) excels, efficiently handling complex computations. Machine Learning (ML) methods analyze vast social media data, enlightening user behaviors, preferences, and trends. Our proposal integrates ML to characterize user attributes and develop a comprehensive user model for ABM simulations in social networks on HPC systems. |
11:45 | Enhancing machine learning dashboards using one-shot neural architecture search ABSTRACT. Artificial Neural Networks (ANNs) have revolutionized numerous fields by mimicking the human brain's ability to learn and solve complex problems. This work proposes a new machine learning dashboard that integrates Neural Architecture Search (NAS) capabilities. The dashboard efficiently manages the model execution in HPC clusters and employs NAS to optimize its architecture. By automating the design process, it aims to enhance the performance of ANNs across different applications. Integrating NAS methodologies into ML dashboards addresses the need for optimizing the architecture of ANNs. This approach facilitates the design process and empowers researchers to achieve better results with ANNs. |
12:00 | PRESENTER: Subhajit Sahu ABSTRACT. Link prediction can help rectify inaccuracies in various graph algorithms, stemming from unaccounted-for or overlooked links within networks. However, many existing works use a baseline approach, which incurs unnecessary computational costs due to its high time complexity. Further, many studies focus on smaller graphs, which can lead to misleading conclusions. This submission introduces our parallel approach, called LHub, which predict links using neighborhood-based similarity measures on large graphs. LHub is a heuristic approach that disregards large hubs, based on the idea that high-degree nodes contribute little similarity among their neighbors. On a server equipped with dual 16-core Intel Xeon Gold 6226R processors, LHub is on average 1019x faster than not disregarding hubs, especially on web graphs and social networks, while maintaining similar prediction accuracy. Notably, LHub achieves a link prediction rate of 38.1M edges/s and improves performance at a rate of 1.6x for every doubling of threads. |
12:15 | PRESENTER: Bernardo Bernardino Gameiro ABSTRACT. Compton cameras are radiation detectors that provide spatial information on the origin of the γ-ray sources based on the Compton scattering effect. Many applications require these detectors to be used at high counting rate. As such, the preprocessing of the detections as well as the imaging algorithms are required to be time-efficient in order for the data to be processed in real time. In this work, optimizations to the preprocessing of events in Compton cameras based on monolithic crystals, with special focus on event identification, were implemented using a parallelizable algorithm. Regarding imaging, an established 3D back projection algorithm was parallelized and implemented using SYCL. The parallel implementation of the algorithm was included without and with several optimizations such as the pre-computing values, discarding low impact contributions based on angle, and selecting an efficient shape of the image universe. The implementations were tested with Intel CPUs, GPUs, and NVIDIA GPUs. An outlook into the study of algorithms to reconstruct the position of interaction within Compton cameras based on monolithic crystals into segmented regions and other next steps is included. |
10:30 | Efficient Deployment and Fine-Tuning of Transformer-Based Models on the Device-Edge PRESENTER: Hongfeng Li ABSTRACT. Transformer-based pre-trained models have achieved breakthrough results in deep learning, and fine-tuned models can achieve good performance on a wide range of tasks. However, it is challenging to directly deploy these models and fine-tune them with local data, on resource-constrained devices. To address this limitation, this paper introduces a framework for the distributed deployment of transformer-based models on the device-edge and outlines the process for efficient fine-tuning. By fine-tuning partial parameters of the model instead of adjusting all parameters, the computation and memory overhead in the fine-tuning task is optimized. The communication pressure is reduced by freezing and masking a portion of the neurons in the layer. Evaluation results indicate that this simple and efficient fine-tuning method has little impact on the model accuracy. |
10:54 | Seamless FPGA Integration with Stream Processing Engines PRESENTER: Gabriele Mencagli ABSTRACT. Stream processing is a computing paradigm enabling the analysis of data streams arriving at high speed from data procedures. Its goal is to extract knowledge and complex events by processing streams with high throughput and low latency. To accomplish this goal, Stream Processing Engines (SPEs) try to exploit the parallel processing capabilities provided by modern hardware (usually multi-core CPUs and distributed systems). The exploitation of hardware accelerators, and in particular of FPGAs is promising because they can maximize parallelism and reduce energy consumption. However, programming FPGAs is a very cumbersome and challenging task requiring a lot of expertise. In this paper, we discuss the seamless integration of FSPX, a prototype system for generating FPGA-based implementations of streaming pipelines, with an existing SPE (WindFlow). Our goal is to integrate these two tools by providing high-level programming interfaces to end users and guaranteeing high performance with efficient hardware utilization. |
11:18 | Scheduling Microservices Applications in the Cloud-to-Edge Continuum PRESENTER: Orazio Tomarchio ABSTRACT. The growing use of the Internet of Things (IoT) has led to a widespread adoption of Cloud and Edge computing technologies to better manage and analyze the huge amount of data generated by IoT devices. The combination of Cloud and Edge computing paradigms attempts to avoid their pitfalls while offering the best of both worlds: Cloud scalability and computing closer to the Edge where data is typically generated. However, placing microservices in such highly heterogeneous environments while meeting Quality of Service (QoS) constraints is a challenging task due to the geo-distribution of nodes and the varying heterogeneous computational resources. In this work, we propose an approach for scheduling microservices applications across the Cloud-to-Edge continuum to minimize QoS violations on the response times of these applications. Our approach has been evaluated on the iFogSim simulator and compared with some of its built-in scheduling strategies. |
11:42 | Towards declarative traffic engineering for guaranteed latency-based forwarding PRESENTER: Jacopo Massa ABSTRACT. Cloud-edge applications often require data flows to meet stringent latency requirements, ensuring punctual IP packet delivery within specific time windows. To this end, the guaranteed Latency Based Forwarding (gLBF) approach features per-hop bounded delays to meet the target latency requirements of data flows based on end-to-end application needs. This article introduces a Prolog-based specification of gLBF for path selection and delay configuration. Our prototype determines paths and delays to meet data flow latency targets, offering a concise and extendable solution to the considered problem. We illustrate the scalability of our approach on real and random network topologies, with up to 1000 nodes and flows to be placed. |
12:06 | Hardware/Software Co-design for FPGA-based Scalable Secure Edge Computing Platforms ABSTRACT. As one of the most sufficient approaches to exploit the software's flexibility and the hardware's efficiency, hardware/software co-design is widely used for accelerating computing systems from high performance to edge. This paper presents an efficient hardware/software co-design approach for developing secured edge computing systems with various data security approaches integrated into reconfigurable fabrics to support processing processor systems. With the proposed system, multiple security approaches can be selected and configured for the edge computing platforms for different purposes or adapted to other attacks. Multiple cryptography and security approaches have been tested with FPGA-based edge computing platforms, ranging from low-end to mid-end, to validate and estimate the framework. Our experimental results show that our edge computing system is more efficient than all other platforms while offering higher throughput than different processors. |
More information of the tutorial program at https://www.danysoft.com/workshop-intel-danysoft-euro-par-2024/.
10:30 | LOMOS: Runtime Security Monitoring Fit for the Cloud Continuum PRESENTER: Hrvoje Ratkajec ABSTRACT. Given the challenges faced by various industries in the global digital transformation process, it is essential to perform detection of anomalies, consuming system logs collected and returning anomaly score, which should significantly enhance the visualization of vulnerabilities and improve the overall security posture of systems. This paper presents LOg MOnitoring System (LOMOS), a robust AI technology and methodology for anomaly detection on logs, tailored to adapt to new data sensitivity concerns. LOMOS facilitates the creation of informative metrics/variables with significant screening capabilities, addressing the critical need for real-time monitoring of stack conditions to fuel its self-healing mechanisms. The proposed system is designed to detect security-related events and incidents within the deployed application environment and is deployable automatically, providing users with timely notifications about security episodes. In this paper, we demonstrate the advantages of this approach in the continuous detection of vulnerabilities, threats and malware in production infrastructures and during software development phases, appearing in the infrastructure when new services or features are added, or simply when new vulnerabilities are discovered in existing (outdated) services. By seamlessly integrating this novel transformer-based anomaly detection methodology with the cloud continuum, it facilitates a smooth and secure digital transformation process, ensuring a comprehensive adherence to evolving security requirements while supporting the dynamic nature of modern infrastructures. |
10:50 | Taming the Swarm: a role-based approach for autonomous agents PRESENTER: Francesc-Josep Lordan Gomis ABSTRACT. Hyper-distributed applications leverage the Compute Continuum to offer highly-available, trustworthy and secure services with low latency. Given the number of managed devices and their heterogeneity, finding an optimal distribution of these application's functionalities is a challenging problem tackled either manually or with a resource-hungry, centralised solution. This work proposes a ground-breaking programming environment where applications are composed of several cooperative roles to be played by the devices. A novel software stack transforms each device into an Autonomous ageNT (ANT) deciding which roles it executes, according to its capabilities and context, and enabling the construction of a cooperative, self-organized platform. |
11:10 | PRESENTER: Chanaka Hettige ABSTRACT. Offloading collective communication operations to hardware platforms is increasingly becoming common in the research space. This paper presents offloading All Reduce Sum collective communication operation to programmable logic by utilizing PyTorch Distributed Learning Library. While being widely accessible, this enables existing PyTorch code bases to be adapted with small to no code changes. Programmable Logic handles computations and intermediate communications, reducing the load and the number of communications handled by software. Furthermore, the hardware design is self-contained such that even without the PyTorch based user interfacing, if the relevant data arrives at the hardware, it is capable of carrying out the full reduction and outputting the results. This further enables the hardware design to be used as an intermediary accelerator for edge data. We are utilizing NetFPGA as the hardware platform given its wide availability, enabling users to easily benefit from the accelerations. The current design focuses on All Reduce Sum operations as this is a necessary operation in most distributed learning applications and accounts for a larger fraction in the network latency. Our overall design goes head-to-head with GPU based NCCL at 296.078 us and as a standalone accelerator, computations and communications take only 56.012 us. |
11:30 | Evaluation of Multi-Armed Bandit algorithms for efficient resource allocation in Edge platforms PRESENTER: Jiangbo Wang ABSTRACT. As computational systems become more heterogeneous and the number of computing nodes increases, designing high-performance and energy-efficient scheduling policies for Edge/IoT platforms has become increasingly important. This work evaluates multi-armed bandit (MAB) strategies for efficient resource allocation in Edge platforms, like smart cities or smart buildings. Factors like parallel performance and energy usage optimization were taken into account. The resource allocation methods proposed in this paper extend beyond simulations, involving real tasks run on actual IoT devices. We find that MAB scheduling, adapted to the specifics of applications running on heterogeneous IoT devices, can effectively balance execution time and power consumption, thus achieving optimal task allocation and resource utilization. |
11:50 | Placing Computational Tasks within Edge-Cloud Continuum: a DRL Delay Minimization Scheme PRESENTER: Anastasios Giannopoulos ABSTRACT. In the rapidly evolving landscape of IoT-Edge-Cloud continuum (IECC), effective management of computational tasks offloaded from mobile devices to edge nodes is crucial. This paper presents a Distributed Reinforcement Learning Delay Minimization (DRL-DeMi) scheme for IECC task offloading. DRL-DeMi is a distributed framework engineered to tackle the challenges arising from the unpredictable load dynamics at edge nodes. It empowers each edge node to independently make offloading decisions, optimizing for non-divisible, latency-sensitive tasks without reliance on prior knowledge of other nodes' task models and decisions. By framing the problem as a multi-agent computation offloading scenario, DRL-DeMi aims to minimize expected long-term latency and task drop ratio. Adhering to IECC requirements for seamless task flow within the Edge layer and between Edge-Cloud layers, DRL-DeMi considers three computation decision avenues: local computation, horizontal offloading to another edge node, or vertical offloading to the Cloud. Integration of advanced techniques such as long short-term memory (LSTM), double deep Q-network (DQN), and dueling DQN enhances long-term cost estimation, thereby refining decision-making efficacy. Simulation results validate DRL-DeMi's superiority over baseline offloading algorithms, showcasing reductions in both task drop ratio and average delay. |
13:30 | A Chapel-based Multi-GPU Branch-and-Bound Algorithm PRESENTER: Guillaume Helbecque ABSTRACT. The increasing heterogeneity and diversity of modern supercomputers brings, along with the heterogeneity challenge, both the code and performance portability issues. In this context, the Chapel programming language comes as a solution for both the heterogeneity and portability challenges, as it provides vendor-agnostic GPU support. In this paper, we deal with the design and implementation of a multi-GPU Branch-and-Bound algorithm in Chapel. The contribution consists of a generic multi-pool data structure coupled with a dynamic load balancing mechanism based on work stealing. While the CPU cores are used to perform a parallel tree exploration, GPU devices are used to accelerate the bounding phase, which is particularly compute intensive. Extensive experiments on the Permutation Flowshop Scheduling Problem reveal that the proposed approach can achieve a strong scaling efficiency of up to 63% and 75% on average using a GPU-powered processing node including 8 NVIDIA A100 devices and AMD MI50 GPUs, respectively. This demonstrates the efficiency of our approach to solving large combinatorial optimization problem instances, while ensuring code portability. |
14:00 | Accelerating scientific computing kernels by fusing the polyhedral and tensor compilers PRESENTER: Changbo Chen ABSTRACT. Polyhedral compilers and tensor compilers have achieved great success on accelerating respectively scientific computing kernels and deep learning networks. Although much work has been done to integrate techniques of the polyhedral model to tensor compilers for accelerating deep learning, leveraging the powerful auto-tuning ability of modern tensor compilers to accelerate more general scientific computing kernels is hallenging and is still at its dawn. In this work, we introduce a method to accelerate a family of basic scientific computing kernels by fusing the polyhedral compiler Pluto and the tensor compiler TVM to generate efficient implementations target on the heterogeneous CPU/GPU platform.The fusion is done by firstly applying Pluto to generate a new polyhedral model of the loop to enable rectangular tiling and expose parallelism, and then converting the new polyhedral model, optimized further by padding or shifting, to a valid and efficient tensor compute in TVM. Experiments on $18$ typical scientific computing kernels show that our method achieves $4.94\times$ speedup on average over a typical polyhedral compiler PPCG on GPU. |
14:30 | Element scheduling for GPU-accelerated finite-volumes computations. PRESENTER: Franco Seveso ABSTRACT. The numerical solution of partial differential equations presents inherent challenges in achieving efficient parallelism due to dependencies between grid nodes. This article introduces a novel synchronization mechanism that aggressively tests thread dependencies, enabling immediate computation scheduling upon resolution. We conduct a comparative analysis with classical level-set synchronization within the Strongly Implicit Procedure (SIP) framework and apply it to a real-world fluid dynamics problem. Experimental results demonstrate a significant speedup, showing up to 23% better performance than previous level-set implementations. |
13:30 | Topology Aware Aggregation for Federated Graph Learning ABSTRACT. Graph machine learning has recently gained significant traction in academia and industry. Both areas see training of machine learning models, such as graph neural networks (GNNs), on massive graph datasets. In most real-world settings, however, such as banking transaction networks and healthcare networks, this data is localized with multiple data owners and cannot be aggregated due to privacy concerns. Federated Graph Learning (FGL) has emerged as a viable approach to tackle this issue. FGL involves training shared graph machine learning models locally at the data owners and aggregating them to achieve a global GNN model, while simultaneously addressing privacy concerns and regulations. While FGL has seen focused research interest in recent years, with several new frameworks proposed and existing federated learning (FL) frameworks adding support for FGL, few works have attempted to adapt the standard FL components, such as client selection and aggregation for the specific application of FGL. We present a preliminary study on adapting the standard aggregation algorithms used in FL for FGL by leveraging graph topology. This study is part of a broader objective to explore solutions for a resource-efficient, robust, and scalable FGL framework. |
13:45 | Training a Transformer Model for OpenMP Pragma Generation PRESENTER: Arijit Bhattacharjee ABSTRACT. Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). While the generic abilities of these code LLMs are useful for many programmers in tasks like code generation, the area of high-performance computing (HPC) has a narrower set of requirements that make a smaller and more domain-specific model a smarter choice. This paper presents OMPGPT, a novel domain-specific model meticulously designed to harness the inherent strengths of language models for OpenMP pragma generation. Furthermore, we leverage prompt engineering techniques from the NLP domain to create Chain-of-OMP, an innovative strategy designed to enhance OMPGPT's effectiveness. Our extensive evaluations demonstrate that OMPGPT outperforms existing large language models specialized in OpenMP tasks and maintains a notably smaller size, aligning it more closely with the typical hardware constraints of HPC environments. |
14:00 | Augmenting Cloud Resource Management with the Necessary Amount of Machine Intelligence ABSTRACT. Cloud computing and data center environments often suffer from low resource efficiency due to overprovisioning and suboptimal management decisions. Improving resource efficiency requires accurate forecasting of metrics like resource consumption and user utilization patterns. However, forecasting resource usage is challenging due to diverse and dynamic patterns across different levels and types of resources. This doctoral thesis aims to contribute a comprehensive approach to designing an end-to-end cloud resource management system aimed at enhancing resource, energy, and cost efficiency. The research explores a range of machine learning (ML) and non-ML-based methods for predicting future resource usage and evaluates them based on application-level metrics. A key contribution of this thesis is the development of a novel cloud resource forecasting model that combines ML and non-ML strategies to achieve a balance between resource efficiency and and operational overheads. This model will be integrated into an end-to-end cloud resource management system, leveraging techniques such as overcommitment and autoscaling to optimize resource utilization and application performance. Overall, this thesis aims to advance the state-of-the-art in cloud resource management and contribute to more efficient and sustainable cloud computing infrastructures, leveraging only the necessary amount of machine intelligence. |
14:15 | Optimization of LLM Inference Systems PRESENTER: Konstantinos Papaioannou ABSTRACT. The rapid progress in machine learning has lead to the development of very large models with extraordinary capabilities, such as creative writing and report summarization. However, applications powered by these models require significant amounts of compute resources and usually run on expensive GPUs and accelerators, that consume excessive energy. To meet the demand of these applications and reduce the cost of operating them, new custom inference systems have been developed. These systems exploit the unique characteristics of Large Language Models (LLMs) to increase performance and efficiently use compute resources. Our early work highlights the difference in performance of these systems across use cases with different characteristics. In addition, we identify significant room for improvement in performance of specific use cases via improved memory management This thesis plans to build novel LLM inference systems considering different objectives, such as application performance, model accuracy and energy efficiency, while considering the unique characteristics of different use cases and available hardware. |
14:30 | PRESENTER: Pol Capdevila ABSTRACT. Hospitals are presently in need of predictive tools to anticipate emergent situations, particularly those pertaining to mental health crises such as suicide attempts. Despite prior research indicating potential influencing factors, the development of effective solutions remains a formidable challenge. However, recent technological advancements, particularly in artificial intelligence (AI), offer promising avenues for addressing this challenge. Building on these advancements, the present study develops and trains a predictive model utilizing a Long Short-Term Memory (LSTM) neural network. The model is trained using data on suicide attempt admissions and environmental variables, as an influence factor, from a hospital in Catalonia. Results demonstrate the potential of AI to provide valuable insights to hospitals, aiding in the management and optimisation of healthcare resources to effectively address mental health emergencies. |
14:45 | Space Edge Computing for satellite systems: definition and key enabling technologies ABSTRACT. This paper addresses some of the key enabling technologies for emerging space edge computing systems. A specific definition of space edge computing for satellite systems orbiting the earth is given, showing how the computing power in future satellite systems can be improved, in particular through innovative on-board data handling architectures and distributed processing between satellites. |
13:30 | Computational astroplasma simulations: state of the art and future directions |
14:10 | Vlasiator: a global magnetospheric ion-kinetic plasma simulation in the age of GPU supercomputing PRESENTER: Markus Battarbee ABSTRACT. Vlasiator is a space plasma simulation code which models near-Earth collisionless ion-kinetic plasma dynamics in three spatial and three velocity dimensions. It is highly parallelized, modeling the Vlasov equation directly through the distribution function, discretized on a Cartesian grid, instead of the more common particle-in-cell approach. Vlasiator solves the Vlasov equation in the ion-kinetic hybrid formalism, modelling ions directly through the distribution function, with electrons included as a charge-neutralizing fluid. Modeling near-Earth space, plasma properties span several orders of magnitude in temperature, density, and magnetic field. We introduce some recent Vlasiator highlights, as well as talk about future plans. In order to fit the required six-dimensional grids in memory, Vlasiator utilizes a sparse block-based velocity mesh, where chunks of velocity space are added or deleted based on the advection requirements of the Vlasov solver. In addition, the spatial mesh is adaptively refined through cell-based octree refinement. With these innovations, global 6D simulations have been achieved with the fastest CPU-based supercomputers in the world. With the advent of exascale, Vlasiator is being extended to support heterogeneous CPU/GPU architectures. We discuss memory management, algorithmic changes, and kernel construction as well as our unified codebase approach, resulting in portability to both NVIDIA and AMD hardware (CUDA and HIP languages, respectively). We present current performance metrics, discuss some newly developed parallel agorithms, and lay out a plan for optimization to facilitate future exascale simulations using multi-node GPU supercomputing. |
14:30 | Numerical techniques for new advanced high-energy-density laser plasmas. ABSTRACT. In modern times, plasma effects have been modeled using massively parallel simulation techniques to further understanding of space and astrophysical phenomena as well as the results of laboratory experiments. Recent advances open the possibility for laser experiments to produce plasmas in extreme regimes, comparable to some astrophysical environments, where effects such as radiation reaction and even antimatter-pair production become important. In order to properly study these regimes which continue to be computationally challenging, new numerical algorithms for the exotic physics involved must be developed. In this talk, I will review state-of-the-art numerical techniques used to model these unique high-energy environments. |
14:50 | Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis PRESENTER: Jeremy Williams ABSTRACT. Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1's performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1's runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation. |
More information of the tutorial program at https://www.danysoft.com/workshop-intel-danysoft-euro-par-2024/.
13:30 | Navigating the Dynamic Heterogeneous Computing Sphere: The Role of EdgeHarbor as a Multi-Edge Orchestrator ABSTRACT. The term ‘Dynamic Heterogeneous Computing Sphere’ is used to describe a computing paradigm that is heterogeneous, volatile and highly dynamic. This results from the integration of a large set of resources, which are dis-tributed across a continuum that includes both infrastructure (real and virtu-al) and data. These resources are distributed from the extreme IoT to the edge (far and near) and to the cloud. When the objective is to deploy and ex-ecute a set of innovative vertical applications that are heterogeneous in technology and in requirements. The envisioned scenario necessitates the capacity for both resources and services to be elastic, that is, to be capable of being continuously shaped and moulded to support the specific needs of those highly demanding applications, both in the allocation and runtime windows. Furthermore, software modules (those that compose vertical appli-cations) should be intelligently partitioned into virtual elements (i.e., con-tainers) to optimise their placement and consequent execution. This can be achieved by considering aspects such as performance and green aspects. The demand for ad hoc resource and service shaping is increasing in line with the current trends towards softwareisation of systems management. This has been further fuelled by the disaggregation concept and the development of new extremely demanding ultra-real-time services (X-AR, holo, metaverse, etc.). This paper introduces the architecture of the EdgeHarbor orchestrator as an open-source, multi-edge management system for a Dynamic Heteroge-neous Computing Sphere. This work has been almost entirely supported by the European Union’s HORIZON research and innovation programme under grant agreement ICOS (www.icos-project.eu), grant number 101070177 |
13:50 | Multi-component Application Mapping across the Continuum PRESENTER: Jordi Garcia ABSTRACT. This paper presents a novel orchestrating mechanism for multi-component applications across the continuum. The orchestration mechanism provides an initial components-to-nodes mapping at launch time and, in case of anomaly detection at runtime, reschedules the application layout to meet the performance objectives within the application-specified constraints. The mapping engine, named Match Maker, is a core component inside the ICOS research project, an intelligent metaOS conceived to leverage the capabilities of the continuum. It is fed with abundant and accurate information collected at runtime from the telemetry component and enriched with additional forecasting system information generated by the intelligence layer. The novelty of the orchestrating mechanism is that the layout provided distributes the application components across the continuum, fulfilling the application requirements and constraints in the selected nodes, and considering the interdependence effects between selected nodes. In addition, the application can define policies for layout decisions based on performance and security, and policies for anomalies detection and remediation actions. |
14:10 | An experimental study of the response time in an edge-cloud continuum with ClusterLink PRESENTER: Fin Gentzen ABSTRACT. In this paper, we conduct an experimental study to provide a general sense of the application response time implications that inter-cluster communication experiences at the edge at the example of a specific IoT-edge-cloud contiuum solution from the EU Project ICOS called ClusterLink. We create an environment to emulate different networking topologies that include multiple cloud or edge sites scenarios, and conduct a set of tests to compare the application response times via ClusterLink to direct communications in relation to node distances and request/response payload size. Our results show that, in an edge context, ClusterLink does not introduce a significant processing overhead to the communication for small payloads as compared to cloud. For higher payloads and on comparably more aged consumer hardware, ClusterLink version 0.2 introduces communication overhead relative to the delay experienced on the link. |
More information of the workshop program at https://vhpc.github.io/program/.
15:30 | StarONNX: a Dynamic Scheduler for Low Latency and High Throughput Inference on Heterogeneous Resources PRESENTER: Jean-François David ABSTRACT. Efficient inference of Deep Neural Network (DNN) models on heterogeneous processors is challenging, not only because of the heterogeneity between CPUs and hardware accelerators, but also because the problem is fundamentally bi-objective in many contexts, since both latency (time to perform an inference) and throughput (number of inferences per unit time) need to be optimized. We present StarONNX, a solution based on integrating ONNX Runtime in StarPU, which aims to optimize the distribution of inference tasks and resource management on heterogeneous architectures. This strategy relies on (i) the efficient execution of deep learning models by ONNX Runtime to maximize individual resource efficiency, and (ii) the orchestration of heterogeneous resources by StarPU to provide sophisticated scheduling and overlapping strategies for computation and communication. An essential point of the framework is the ability to split a DNN into two parts, one running on the GPU and the other on the CPU, thus increasing throughput by using all possible resources with a minimal degradation of worst case latency. We show that integrating the ONNX Runtime into StarPU does not introduce significant overhead. We also evaluated our approach against the Triton Inference Server and showed a significant improvement in resource utilization and reduced latency. |
16:00 | Strategies for memory management improvement in deep learning algorithms for contrast enhancement of high-resolution radiological images PRESENTER: Daniel Alejandro Rodriguez Lopez ABSTRACT. This paper presents a detailed analysis of different techniques to reduce memory consumption during the training stages in a Deep Learning algorithm for contrast enhancement in radiography. This makes it possible to work with large images, such as radiological images, avoiding the problems derived from working with patches or the loss of spatial resolution caused by subsampling. For this purpose, different approaches have been studied experimentally, both in single-GPU executions and in multi-GPU systems based on data and model parallelism. Experimental evaluation shows that it is possible to achieve up to 20% reduction for a single node and up to 70% for the distributed model without loss of accuracy. |
16:30 | A Practical Survey on Static Task Scheduling Optimization Approaches for Heterogeneous Architectures PRESENTER: Jonas Hollmann ABSTRACT. The complexity of software increases constantly, even in em- bedded real-time and safety-critical systems. This ever-increasing computational demand causes more complex application-specific accelerators to get integrated into modern computational systems, causing a shift towards heterogeneous architectures. In most safety-critical real-time ap- plications, the workload to be executed is known beforehand, allowing the developer to determine a static scheduling of the workload upfront. Generally, programming heterogeneous processors quickly becomes an almost impossible challenge for software developers. As such, numerous approaches for the automatic generation of static schedules have been proposed over the years. These heuristic approaches make use of simplified models of the target platform by ignoring concepts like memory locality or processor core clustering, as well as a simplified graph representation of the software in exchange for lower computational complexity. However, we have yet to see a practical survey of existing approaches, especially in the context of embedded real-time and safety-critical systems. Thus, we map existing heuristic approaches for the automatic generation of an approximate optimal static task scheduling to real hardware using a generic approach. In doing so the effects of the simplification during the modeling process, as well as the efficiency of the underlying heuristic are assessed by benchmarking the results of such scheduling optimization algorithms applied to algorithms of various complexity such as the Fast Fourier Transform and sparse matrix multiplication on several heterogeneous target architectures. Finally, we discuss the capabilities of existing approaches and their applicability for modern real-time critical embedded systems. |
15:30 | Simulations of out–of–equilibrium phenomena using the Yambo project |
16:10 | Runtime Instantiation of Kernels for Fast Fourier Transforms Using SYCL Specialization Constants PRESENTER: Víctor Pérez Carrasco ABSTRACT. In the domain of physics simulations, the pursuit of precision and scale exacerbates the need for additional compute power, for which heterogeneous systems have become an increasingly popular solution. SYCL (5), an industry defined, vendor-agnostic heterogeneous programming model implemented using standard C++, is a good candidate as a solution to write portable code that can fulfill the needs of highly demanding workloads. An important part of the SYCL ecosystem are libraries such as oneMKL (2), a mathematical kernel library, or portFFT (1), a SYCL library for Fast Fourier Transforms (FFTs), which not only provide performant building blocks, but also aid in developer productivity. In this talk, we will give an overview of portFFT, leveraging SYCL to accelerate FFTs of different input sizes, which can be used in different physics applications, e.g. Gromacs (4). portFFT, as well as other similar libraries, balances binary size and performance for different FFT configurations. Ideally, we would like to have one highly optimized kernel for each possible input size generated at compile time, but this would greatly increase binary size, complicating deployment. As a mitigation, SYCL introduces specialization constants, values which are guaranteed to stay constant throughout the execution of a kernel whose value can be set before kernel launch. SYCL implementations may use Just In Time (JIT) compilation to specialize kernels using the provided values before launch. This way, programmers can get the benefits of highly specialized kernels by instantiating them at runtime when the input sizes are already known. We show the results of applying this approach and a new Specialization Constant--Length Allocations (SCLA) extension (5) built on top of specialization constants. Our modified version of the portFFT library using specialization constants and SCLA running a set of our tests reported a speedup compared to the original version using kernel execution time as a metric, demonstrating the potential of specialization constants for SYCL applications and libraries. |
16:30 | Towards Large-scale Top-Down Microarchitecture Analysis Using the Score-P Framework PRESENTER: Hannes Tröpgen ABSTRACT. The Top-Down Microarchitecture Analysis Method introduced by Yasin [31] established a simple yet effective generic approach to performance analysis. It can be used to analyze which microarchitectural resource poses a bottleneck during the execution of arbitrary codes. The recently introduced Golden Cove microarchitecture used in Intel’s Sapphire Rapids processors provides an extended, automated hardware support for this analysis. In this paper, we describe the implementation of this microarchitectural feature, integrate support into the performance measurement infrastructure Score-P, and explore the capabilities of the resulting tool with two example applications: The Computational Fluid Dynamics (CFD) solver for turbo-machinery applications Hydra [23], and the open-source weather and climate model ICON [32]. |
16:50 | A Survey on Cellular Automata Parallel Execution Techniques |
More information of the workshop program at https://vhpc.github.io/program/.