Program for Monday, August 26th

ABSTRACT. With the growth of real-time applications and IoT devices, computation is moving from cloud-based services to the low latency edge, creating a computing continuum. This continuum includes diverse cloud, edge, and endpoint devices, posing challenges for software design due to varied hardware options. To tackle this, a unified resource manager is needed to automate and facilitate the use of the computing continuum with different types of resources for flexible software deployments while maintaining consistent performance. Therefore, we propose a seamless resource manager framework for automated infrastructure deployment that leverages resources from different providers across heterogeneous and dynamic Edge-Cloud resources, ensuring certain Service Level Objectives (SLOs). Our proposed resource manager continuously monitors SLOs and reallocates resources promptly in case of violations to prevent disruptions and ensure steady performance. The experimental results across serverless and serverful platforms demonstrate that our resource manager effectively automates application deployment across various layers and platforms while detecting SLO violations with minimal overhead.

09:00-10:00 Session 1D: EuroQHPC - Session A

Location: -1.A.01

09:00

Francisco Javier Cardama Santiago, Jorge Vázquez Pérez, Tomás Fernández Pena, Juan C. Pichel Campos and Andrés Gómez Tato

Quantum Compilation Process: a Survey

PRESENTER: Francisco Javier Cardama Santiago

ABSTRACT. Quantum compilation, critical for bridging high-level quantum programming and physical hardware, faces unique challenges distinct from classical compilation. As quantum computing advances, scalable and efficient quantum compilation methods become necessary. This paper surveys the landscape of quantum compilation, detailing the processes of qubit mapping and circuit optimization, and emphasizing the need for integration with classical computing to harness quantum advantages. Techniques such as Variational Quantum Eigensolver (VQE) exemplify hybrid approaches, highlighting the potential synergy between quantum and classical systems. It is concluded that, while quantum compilation retains many classic methodologies, it introduces novel complexities and opportunities for optimization and verification, essential for the evolving field of quantum computing.

09:20

Laura M. Donaire, Gloria Ortega, Francisco Orts and Ester M. Garzón

Optimizing a quantum BCD Adder in terms of T-gates and CNOT gates

PRESENTER: Laura M. Donaire

ABSTRACT. Quantum computing emerges as a pivotal solution to classical computing limitations in the post-Moore era, offering superior capabilities through qubits' unique properties of superposition and entanglement. Actually, we are in the Noisy Intermediate-Scale Quantum where quantum devices present challenges in circuit design. This work proposes optimized implementations of Binary Coded Decimal adders using Clifford+T gates. With a focus on mitigating the computational costs associated with T-gates and the error rates of CNOT gates, two distinct designs are presented. The first design prioritizes minimizing T-gate usage, leading to significant reductions in T-count and T-depth, while slightly decreasing the number of CNOT gates. The second design targets minimizing CNOT gate usage, resulting in a 15\% reduction in CNOT gates, alongside notable reductions in T-count and T-depth. Leveraging the capabilities of the Clifford+T gate set, both designs showcase a balance between efficiency and error mitigation.

09:00-10:00 Session 1E: HiPES - Session A

Location: -1.A.07

09:00

Felix Wolf

Tackling the imbalance between computation and I/O

09:00-10:00 Session 1F: AMTE - Session A

Location: -1.A.05

09:00

Joel Falcou

Keynote talk: Tools of the Trade Embracing Our Inner Craftsman

ABSTRACT. Designing software tools and libraries represents a critical yet often overlooked aspect in the field of software engineering. Many prioritize rapid algorithm or application development, overlooking the productivity gains from thoughtful tools and library design. This presentation will argue for a more balanced approach, emphasizing the craftsmanship involved in creating robust, effective software tools. By focusing on the development of specialized, domain-specific tools and libraries, developers can achieve higher productivity and software quality. The talk will cover best practices, case studies, and methodologies for tool and library design, aiming to inspire attendees to embrace their inner craftsman in software development.

09:40

John Holmen, Marta Garcia, Allen Sanderson, Abhishek Bagusetty and Martin Berzins

Lessons Learned and Scalability Achieved when Porting Uintah to DOE Exascale Systems

PRESENTER: John Holmen

ABSTRACT. A key challenge faced when preparing codes for Department of Energy (DOE) exascale systems was designing scalable applications for systems featuring hardware and software not yet available at leadership-class scale. With such systems now available, it is important to evaluate scalability of the resulting software solutions on these target systems. One such code designed with the exascale DOE Aurora and DOE Frontier systems in mind is the Uintah Computational Framework, an open-source asynchronous many-task runtime system. To prepare for exascale, Uintah adopted a portable MPI+X hybrid parallelism approach using the Kokkos performance portability library (i.e., MPI+Kokkos). This paper complements recent work with additional details and an evaluation of the resulting approach on Aurora and Frontier. Results are shown for a challenging benchmark demonstrating interoperability of 3 portable codes essential to Uintah-related combustion research. These results demonstrate single-source portability across Aurora and Frontier with strong-scaling characteristics shown to 768 Aurora nodes and 9,216 Frontier nodes. In addition to showing results run to new scales on new systems, this paper also discusses lessons learned through efforts preparing Uintah for exascale systems.

10:00-10:30Coffee Break

10:30-12:30 Session 2A: FEDn - Session B

More information of the minisymposium program at https://sciml.se/euro-par-conference-2024-minisymposium/.

Location: -1.A.02

10:30-12:30 Session 2B: ABUMPIMP - Session B

More information of the minisymposium program at https://www.upmem.com/abumpimp-2024/.

Location: -1.A.06

10:30-12:30 Session 2C: DYNRESHPC - Session B

Location: -1.A.04

10:30	Rajat Bhattarai, Howard Pritchard and Sheikh Ghafoor Evaluation of a Dynamic Resource Management Strategy for Elastic Scientific Workflows PRESENTER: Sheikh Ghafoor ABSTRACT. As scientific workflows grow in complexity, often combining AI tasks with traditional high-performance computing (HPC) simulations, there's an urgent need for dynamic resource management and elastic execution for better utilization of resources in HPC supercomputers. Through this elasticity, it becomes possible to steer computational processes in real-time, improving the efficiency of both scientific workflows as well as resource management systems. This paper presents a performance assessment of a dynamic resource management strategy for scientific workflows in HPC systems, based on the elastic PMIx-enabled Parsl workflow manager and a custom hierarchical scheduler on top of Slurm, focusing on its ability to efficiently scale and manage resources in real-time. Using a series of controlled experiments and a case study involving real applications of domains such as bioinformatics, we analyze how this kind of resource management strategy with elasticity impacts the performance of scientific workflows and HPC systems. By integrating quantitative analysis with practical insights, this paper aims to inform future developments and optimizations in dynamic resource management and scheduling for elastic scientific computing.
11:00	Iker Martín Álvarez, José Ignacio Aliaga, Maribel Castillo and Sergio Iserte MaM: A User-Friendly Interface to Incorporate Malleability into MPI Applications PRESENTER: Iker Martín Álvarez ABSTRACT. Malleability can be defined as the capability of a distributed MPI parallel job to modify the number of processes without pausing their execution, by reallocating the computational resources originally allocated to the job. In general, malleability consists of four stages: reallocating resources, managing processes, redistributing data and resuming execution. MaM is a tool that allows the incorporation of malleability into parallel MPI-based applications. This work introduces the MaM interface, which allows the programmer to use its capabilities in a simple and transparent way. It also compares the cost of reconfigurations using different strategies offered in MaM and against an ideal time.
11:30	Jonas Posner The Impact of Evolving APGAS Programs on HPC Clusters ABSTRACT. High-performance computing (HPC) clusters are traditionally managed statically, i.e., user jobs maintain a fixed number of computing nodes for their entire execution. This approach becomes inefficient with the increasing prevalence of dynamic and irregular workloads, which have unpredictable computation patterns that result in fluctuating resource needs at runtime. For instance, nodes cannot be released when they are not needed, limiting the overall supercomputer performance. However, the realization of jobs that can grow and shrink their number of node allocations at runtime is hampered by a lack of support in both resource managers and programming environments. This work leverages evolving programs that grow and shrink autonomously through automated decision-making, making them well-suited for dynamic and irregular workloads. The Asynchronous Many-Task (AMT) programming model has recently shown promise in this context. In AMT, computations are decomposed into many fine-grained tasks, enabling the runtime system to transparently migrate these tasks across nodes. Our study builds on the APGAS-AMT runtime system, which supports evolving capabilities, i.e., handles process initialization and termination automatically requiring minimal additions to user code. We enable interactions between APGAS and a prototype resource manager as well as extend the Easy-Backfilling job scheduling algorithm to support evolving~jobs. We conduct real-world job batch executions on 10 nodes—involving a mix of rigid, moldable, and evolving programs—to evaluate the impact of evolving APGAS programs on supercomputers. Our experimental results demonstrate a 23% reduction in job batch makespan and a 29% reduction in job turnaround time for evolving jobs.
12:00	Sergio Iserte, Victor Lopez, Marta Garcia-Gasulla and Antonio J. Peña Parallel Efficiency-aware Standard MPI-based Malleability PRESENTER: Sergio Iserte ABSTRACT. This article presents the integration of the TALP performance metrics collector in the Dynamic Management of Resources Library (DMRlib) to let Slurm take performance-aware reconfiguration actions. Traditionally, scientific applications that make use of high-performance computing cannot reallocate resources once assigned, making it difficult to adapt workloads. Dynamic resource management, especially through the adaptability of MPI processes, is proposed as a solution to this limitation. Thanks to the integration of TALP, DMRlib will offer new reconfiguration policies based on performance metrics such as parallel efficiency, load balancing, or communication efficiency.

10:30-12:30 Session 2D: EuroQHPC - Session B

Location: -1.A.01

10:30	Ernesto Acosta Martín, Carlos Cano Gutiérrez, Guillermo Botella and Roberto Campos Adiabatic training for Variational Quantum Algorithms PRESENTER: Ernesto Acosta Martín ABSTRACT. This paper presents a new Quantum Machine Learning model where classical data gets processed on a quantum computer. We propose a hybrid Quantum-Classical model composed of three elements: a classical computer in charge of the data preparation and interpretation; a Gate-based Quantum Computer running the VQA (Variational Quantum Algorithm) representing the Quantum Neural Network; and an adiabatic Quantum Computer where the optimization function is executed to find the best parameters for the VQA. As of the moment of this writing, the majority of Quantum Neural Networks are being trained using gradient-based classical optimizers having to deal with the barren-plateau effect. Some gradient-free classical approaches such as Evolutionary Algorithms have also been proposed to overcome this effect. However, to the knowledge of the authors, adiabatic quantum models have not been used to train VQAs. The paper compares the results of gradient-based classical algorithms against adiabatic optimizers and shows the feasibility of integration for gate-based and adiabatic quantum computing models, avoiding the barren plateau effect and opening the door to modern hybrid quantum machine learning approaches for High Performance Computing.
10:55	Luis Sánchez Cano, Guillermo Botella Juan, Ginés Carrascal de Las Heras and Alberto Barrio García Factoring integers via Schnorr’s algorithm assisted with VQE PRESENTER: Luis Sánchez Cano ABSTRACT. Current asymmetric cryptography is based on the principle that while classical computers can efficiently multiply large integers,the inverse operation, factorization, is significantly more complex. For sufficiently large integers, this factorization process can take classical computers hundreds or even thousands of years to complete. However, there exist some quantum algorithms that might be able to factor integers theoretically and, for instance, Yan, B. et al. claim to have constructed a hybrid algorithm which could be able even to challenge RSA-2048 in the near future. This work analyses that article and replicates the experiments they carried out, but with a different quantum method (VQE), being able to factor the number 1961.
11:20	Océane Koska, Maxime Remaud and Arnaud Gazda Hybrid Quantum Computing: the Use Case of Shor’s Algorithm PRESENTER: Océane Koska ABSTRACT. Classical computing, initially focused on Central Processing Units (CPU) programming, has gradually evolved into hybrid computing. For example, some applications use co-processors (e.g., Graphical Processing Units, or Field Programmable Gate Array) to speed up some computations. This shift has not fundamentally changed the way classical applications are designed or developed, but involved the introduction of new tools to extend existing libraries or programming languages. Similarly, the adoption of quantum computing does not mean that the way hybrid applications are developed will profoundly change, as quantum computing is itself a new form of hybrid computing. In this paper, we show that implementing a classical-quantum algorithm can be done by reusing concepts introduced by classical computing, through the implementation of Shor’s algorithm in a hybrid version. Our implementation of Shor’s algorithm was written in C++ thanks to the Q-Pragma framework, which extends C++ to introduce new quantum directives.
11:45	Joanna Wiśniewska and Marek Sawerwain Variational Quantum Eigensolver for Classification in Distributed Data Sets PRESENTER: Joanna Wiśniewska ABSTRACT. In this work, we take into consideration a quantum circuit which is based on the Variational Quantum Eigensolver (VQE) and so-called SWAP-Test what allows us to solve a classification problem for distributed data -- there are only two classes, but samples form many clusters which directly neighbor to clusters of samples from another class. The classical data observations are converted into normalized quantum states. After this operation, samples may be processed by a circuit of quantum gates. The VQE approach allows training the parameters of a quantum circuit (so-called ansatz) to output pattern-states for each class. In the utilized data set, two classes may be observed, however, the VQE circuit differentiates more classes than two (introduces more detailed cases because the samples are distributed) and the final results are obtained with the use of aforementioned SWAP-Test. The combination of the VQE and the SWAP-Test allows for the construction of a flexible system where various data sets may be classified by changing parameters of the VQE circuit. The elaborated solution is compact and requires only logarithmically increasing number of qubits (due to the exponential capacity of quantum registers). All calculations, simulations, plots, and comparisons were implemented and conduced in the Python language environment. Source codes for each example of quantum classification can be found in the source code repository.

10:30-12:30 Session 2E: HiPES - Session B

Location: -1.A.07

10:30	Juan José Moreno Riado, Savíns Puertas Martín, Juana López Redondo, Pilar Martínez Ortigosa and Ester Martín Garzón Exploiting Multicore Servers to Optimize IMRT Radiotherapy Planning PRESENTER: Savíns Puertas Martín ABSTRACT. Intensity modulated radiation therapy (IMRT) is a highly effective cancer treatment technique that accurately delivers radiation to cancerous tissues while preserving the surrounding healthy organs. In this work, we present a method to exploit multicore servers to address IMRT Radiation Therapy plans (RP) problems. Our method uses a gradient descent algorithm to optimize the generalized Equivalent Uniform Dose parameters, and employs high-performance computing techniques such as parallelization and batching to speed up the computation. To evaluate our proposal, we conducted extensive benchmarking on three distinct multicore platforms with varying micro-architectures, assessed across different batch sizes and thread configurations. The results showcase that our method provides substantial computational speed improvements while consistently generating high-quality RP that conform to clinical constraints, albeit at a high computational cost. The parallelization schemes outlined in this work attain substantial speedups while still delivering clinically feasible plans, ultimately resulting in time savings and reduced workload for medical planners.
10:50	Alberto Mulone, Doriana Medic and Marco Aldinucci A Fault Tolerance Mechanism for Hybrid Scientific Workflows PRESENTER: Alberto Mulone ABSTRACT. In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application as a workflow is possibility to utilize the Workflow Management Systems which are reliable systems guaranteeing the correct execution of the application and providing the features such as portability, scalability, and fault tolerance. Over recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. As a consequence, the number of possible points of failure in the execution augmented, creating different important challenges interesting to study. This paper presents the implementation of a fault tolerance mechanism for hybrid workflows based on the recovery and rollback approach. A representation of the hybrid workflows with the formal framework is provided, together with the experiments demonstrating the functionality of implementing approach.
11:10	Dante Sánchez-Gallegos, Diana Carrizales, Catherine Alessandra Torres Charles, Alejandro De La Rosa Zequeira, Jose Luis Gonzalez-Compean and Jesus Carretero GeoNimbus: A serverless framework to build earth observation and environmental services PRESENTER: Dante Sánchez-Gallegos ABSTRACT. Cloud computing has become a popular solution for organizations to implement Earth Observation Systems (EOS). However, this produces a dependency on provider resources. Moreover, the management of the execution of tasks and data are challenges that commonly arise when building an EOS. This paper presents GeoNimbus, a serverless framework for composing and deploying spatio-temporal EOS on multiple infrastructures, e.g., any on-premise resources, and public or private clouds. This framework organizes EOS tasks as functions and automatically manages their deployment, invocation, scalability, and monitoring in the cloud. GeoNimbus framework enables organizations to reuse and share available functions to compose multiple EOS. We use this framework to implement EOS as a service for conducting a case study focused on measuring changes in water resources in a lake located in the south of Mexico. The experimental evaluation revealed the feasibility and efficiency of using GeoNimbus to build different earth observation studies.
11:30	Simone Perrotta, Ciro Giuseppe De Vita, Gennaro Mellone, Marco Edoardo Santimaria, Giuseppe Salvi, Marco Lapegna, Massimo Torquati and Angelo Ciaramella Extending a scientific workflow engine with streaming I/O capabilities: DAGonStar and CAPIO PRESENTER: Simone Perrotta ABSTRACT. The increasing complexity and scale of data-intensive scientific workflows necessitate advancements in workflow engines (WFEs) to handle real-time data streams and reduce input/output (I/O) bottlenecks. This paper introduces an innovative approach to enhancing the DAGonStar scientific workflow engine by integrating CAPIO, an in-RAM ad-hoc file system optimized for high-speed data access and low latency. By combining DAGonStar’s robust task orchestration and dependency management with CAPIO’s efficient streaming I/O capabilities, we aim to significantly improve the performance and scalability of scientific workflows. We present the design and implementation of this integration, detailing the architectural modifications required to enable seamless interaction between DAGonStar and CAPIO. The paper includes comprehensive benchmarks and performance evaluations demonstrating the impact of CAPIO on workflow execution times and data handling efficiency. Our findings indicate that the enhanced DAGonStar, equipped with CAPIO, offers a powerful solution for managing and processing large-scale, real-time data streams, thereby advancing the capabilities of scientific computing infrastructure.
11:50	Hanwen Dai, Changbo Chen and Yuxuan Song Accelerating GCN Inference on Small Graphs PRESENTER: Changbo Chen ABSTRACT. Graph convolutional networks (GCNs) have found wide applications through effectively learning node, edge or graph embedding. While many existing works focus on accelerating GCN inference on a single large graph, in this work, we propose an approach to accelerate GCN inference on a large number of small graphs. The main idea is to implement GCN inference fully relying on dense operators, which enables us to rearrange the order of basic operators and to leverage deep learning compilers like TVM to lift the performance for both single graph and batched graphs. Experimentation on typical small graph datasets shows that our approach achieves significant speedup over DGL. It also outperforms TVM on GCN inference for batched graphs.

10:30-12:30 Session 2F: AMTE - Session B

Location: -1.A.05

10:30	Mehdi Goli Invited talk: Expressing and Optimizing Task Graphs in Heterogeneous Programming through SYCL ABSTRACT. For many compute-intensive problems today, heterogeneous computing is inevitable to meet the demands of these applications. Recent heterogeneous systems often contain multiple different accelerators in addition to the host CPU and leveraging the full computational power of such systems requires the management of complex dependencies between the tasks to overlap computation of independent tasks where possible. Heterogeneous programming is not only about implementing and optimizing kernels - complex heterogeneous applications also require the careful orchestration of multiple computational tasks. Modern heterogeneous programming models such as SYCL therefore not only allow to program a diverse set of accelerators with a single, portable programming model, but through their API also provide powerful facilities to manage task dependencies and parallel execution on multiple accelerators. In SYCL’s case, these facilities include explicit event-based synchronization that can also be found in more low-level models such as CUDA or OpenCL. SYCL also comes with mechanisms for automatic dependency management by the runtime implementation. The SYCL buffer and accessor model, which I will introduce in the talk, allows users to easily declare access requirements for data, while the runtime implementation automatically constructs the directed-acyclic graph of task dependencies in the background. This automatic tracking of dependencies between tasks not only relieves the user from the error-prone tasks of manually inserting synchronization into their code, but also provides opportunity for optimization of the task graph. In particular when offloading a series of tasks to an accelerator, there is potential for optimization by reducing the launch overhead or by leveraging faster memories for data exchange between dependent tasks. Further, with SYCL graphs and SYCL kernel fusion, I will present two extensions for the SYCL programming model that have proven very effective to perform such optimization with an easy-to-use API.
11:10	Subhajit Sahu and Kishore Kothapalli GVEL: Fast Graph Loading in Edgelist and Compressed Sparse Row (CSR) formats PRESENTER: Subhajit Sahu ABSTRACT. Efficient IO techniques are crucial in high-performance graph processing frameworks like Gunrock and Hornet, as fast graph loading can help minimize processing time and reduce system/cloud usage charges. This research study presents approaches for efficiently reading an Edgelist from a text file and converting it to a Compressed Sparse Row (CSR) representation. On a server with dual 16-core Intel Xeon Gold 6226R processors and MegaRAID SAS-3 storage, our approach, which we term as GVEL, outperforms Hornet, Gunrock, and PIGO by significant margins in CSR reading, exhibiting an average speedup of 78x, 112x, and 1.8x, respectively. For Edgelist reading, GVEL is 2.6x faster than PIGO on average, and achieves a Edgelist read rate of 1.9 billion edges/s. For every doubling of threads, GVEL improves performance at an average rate of 1.9x and 1.7x for reading Edgelist and reading CSR respectively.
11:30	Lukas Reitz, Ben Gerhards, John Hundhausen and Claudia Fohry Investigating the Performance Difference of Task Communication via Futures or Side Effects PRESENTER: Lukas Reitz ABSTRACT. Asynchronous Many-Tasking (AMT) is a popular approach to program irregular parallel applications. In AMT, the programmer divides the computation into units, called tasks, and an AMT runtime dynamically maps the tasks to workers for processing. AMT runtimes can be classified by their way of task generation and task cooperation. One of the approaches is Future-based Cooperation (FBC). FBC environments may or may not allow side effects (SE), i.e., task communication through read / write accesses to global data. The addition of SE increases expressiveness but may lead to data races. This paper investigates the performance difference of pure FBC programs and FBC programs with SE in a cluster environment. For that, we use a pair of closely related AMT runtimes that support FBC with and without SE, respectively. The latter is introduced in this paper. In first experiments, we observed a similar performance of equivalent benchmark implementations on the two platforms, suggesting that a carefully implemented AMT runtime may make the usage of pure FBC practical.
12:00	Patrick Diehl, Nojoud Nader, Steven R. Brandt and Hartmut Kaiser Evaluating AI-generated code for C\texttt{++}, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust PRESENTER: Patrick Diehl ABSTRACT. This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes---even the simple example we chose to study here---also difficult for the AI to generate correctly.

12:30-13:30Lunch Break

13:30-14:30 Session 3A: ABUMPIMP - Session C

More information of the minisymposium program at https://www.upmem.com/abumpimp-2024/.

Location: -1.A.06

13:30-15:00 Session 3B: DYNRESHPC - Session C

Location: -1.A.04

13:30

Dario Muñoz-Muñoz, Felix Garcia-Carballeria, Diego Camarmas-Alonso, Alejandro Calderon-Mateos and Jesus Carretero

Malleability in the Expand Ad-Hoc parallel file system

PRESENTER: Dario Muñoz-Muñoz

ABSTRACT. In recent years, I/O requirements have significantly increased in applications in the knowledge areas of big data and artificial intelligence.Therefore, there is a solid motivation to improve this I/O to avoid bottlenecks in data access. For this purpose, the Expand Ad-Hoc parallel file system is being designed and developed. Since the I/O workload can change throughout the application's execution, the file system used for those applications must be malleable enough to adjust accordingly. It may need to allocate additional resources or release existing ones to accommodate the workload necessities and not waste them. This work introduces the Expand Ad-Hoc parallel file system design to support malleability and the initial evaluation performed on the HPC4AI Laboratory supercomputer in Torino. The results show that the malleability operations scaled well when the resources allocated to the file system were modified.

14:00

Jean-Baptiste Besnard, Martin Schreiber and Allen D. Malony

Towards a Scale Invariant Syntax for Dynamic Job-Level Workflows

ABSTRACT. High-performance computing (HPC) is undergoing a significant transformation as workloads become increasingly complex. The proliferation of nested parallelism and collocated functionalities in scientific software has naturally increased software complexity, leading to challenges in expressing launch configurations in a portable manner. This paper proposes the concept of self-unfolding dynamic workflows, which aims to alleviate this bottleneck by defining a compact runtime and mapping syntax that enables compute locality and resource composition between multiple jobs. Our scale-agnostic syntax for resource composition allows for job expressivity and dynamic resource utilization, making it easier to deploy and dynamically manage resources for parallel applications. We present a prototype implementation of this syntax over Slurm and an online visualization tool, demonstrating the advantages of this approach in terms of job layout compactness and usability.

13:30-15:30 Session 3C: EuroQHPC - Session C

Location: -1.A.01

13:30	Elise Jennings, Martin Rüfenacht and Stefan Kister $<\mbox{w}\|\mbox{b}>$ : a Quantum Emulation Workbench for Benchmark Construction and Application Development PRESENTER: Martin Rüfenacht ABSTRACT. Quantum Computing is still currently an immature technology with fast evolving research areas around its implementation and application development. Progress, as in any active research area, is not linear and rapid advances are being reported as a result of deep expertise, experimentation and integration into HPC facilities. Quantum emulators are a crucial tool yet there exists a significant gap between algorithm development using quantum emulators and current NISQ devices when considering the system size and the inclusion of noise. We propose a novel emulation framework for quantum computing which (i) uses real time calibration data from devices to build targets instances, (ii) provides a digital twin of the device which can be augmented for continued development, (iii) provides a framework where third party software stacks for emulation and noise modelling can be easily exchanged for rapid experimentation, and (iv) provides a run repository where snapshots are maintained for reproducibility and tracking of workflows. The framework maintains a model repository for the evolving digital twin of the quantum device as it is continually updated and improved. This digital representation aids application development for NISQ devices at system sizes not yet attainable and allows research to continue during hardware downtime.
13:55	Diego García-Vega, Fernando Plou Llorente, Alejandro Leal Castaño, Elias F. Combaro and José Ranilla LazyQML: A Python library to benchmark Quantum Machine Learning models PRESENTER: Alejandro Leal Castaño ABSTRACT. The fast development of Quantum Computing (QC), with its innovations and advantages, proposes a challenge for the progress of Quantum Machine Learning models. This is due to the rapidly evolving frameworks such as Qiskit and PennyLane, in addition to the ad-hoc nature of creating quantum circuits. However, as far as we know, there is no framework that allows for the systematic, flexible, and straightforward comparison of QML models. Mindful of this, in this work, we present a novel Python library with the objective to compare and benchmark a great variety of models and characteristics based on different ansatzes and architectures from the literature.
14:20	Gonzalo Ferro, Oluwatosin Esther Odubanjo, Diego Andrade and Andrés Gómez TNBS: A Kernel-Based Benchmarking for Digital Quantum Computers PRESENTER: Oluwatosin Esther Odubanjo ABSTRACT. The systematic evaluation of the performance of Quantum Computers allows users to identify the best platform to execute a cer- tain class of workload, and it can also guide the future developments of hardware companies. The NEASQC Benchmark Suite (TNBS) is a methodology designed in the context of the NEASQC (NExtS Applica- tion of Quantum Computing) european project to perform such eval- uation across several benchmark cases identified as common in one or several domains of application of Quantum Computing. TNBS follows several design principles identified as relevant after a thorough evalu- ation of the existing benchmarking methodologies: (1) benchmarks are defined at high-level not being linked to any algorithmic approach or im- plementation, (2) benchmarks are scalable (in qubits) and their output is classically verifiable, (3) performance metrics may be defined per case to fit the nature of the outputs of each case, and (4) the benchmark report also keeps record of all the relevant components of the execution stack. Finally, in the future, the reports may be submitted to a centralized repository, which is equipped with a web interface allowing a systematic and objective comparison of different platforms across the different cases

13:30-15:00 Session 3D: HiPES - Session C

Location: -1.A.07

13:30	Melesio Crespo-Sanchez, Hugo German Reyes-Anastacio, Jose Luis Gonzalez-Compean, Jaqueline Calderon-Hernandez and Ignacio Castillo-Barrios Towards the implementation of ONCA: A Generic, Scalable, and Massive Data Processing Platform for Information Discovery and Analytics PRESENTER: Melesio Crespo-Sanchez ABSTRACT. Analysis and visualization are fundamental components of the data-driven decision-making process. In the health and environment fields, various platforms exist for massive data processing to generate information products that assist decision-makers in crafting public policies. These policies are informed by observed data trends to mitigate potential epidemiological impacts on the population. However, most existing solutions focus on either storage, processing, or visualization of data separately, complicating the implementation of comprehensive analysis depending on the data domain. In this work, we present ONCA, a generic, scalable platform for massive data processing, designed to facilitate data analysis using a microservices architecture. ONCA integrates mechanisms for data processing, the creation of observatories, the publication of information products, and user queries, seamlessly automating the interconnection of these components. Designed as a distributed system for deployment on cloud environments, ONCA enables collaboration and information sharing among organizations. This paper details the proposal of the ONCA platform and presents preliminary results from generating information products using environmental data from the Mexican territory.
13:50	Lorenzo Di Rocco, Riccardo Ceccaroni, Umberto Ferraro Petrillo and Pierpaolo Brutti A Distributed Workflow for Long Reads Self-Correction ABSTRACT. Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, and speed up the polishing process on a large input sample of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of eScience.
14:10	Subhajit Sahu, Kishore Kothapalli and Dip Sankar Banerjee GVE-LPA: Fast Label Propagation Algorithm (LPA) for Community Detection in the Shared Memory Setting PRESENTER: Subhajit Sahu ABSTRACT. Community detection is the problem of identifying natural divisions in networks. Efficient parallel algorithms for this purpose are crucial in various applications, particularly as datasets grow to substantial scales. This technical report presents an optimized parallel implementation of the Label Propagation Algorithm (LPA), a high speed community detection method, for shared memory multicore systems. On a server equipped with dual 16-core Intel Xeon Gold 6226R processors, our LPA, which we term as GVE-LPA, outperforms FLPA, igraph LPA, and NetworKit LPA by 139x, 97, 000x, and 40x respectively - achieving a processing rate of 1.4B edges/s on a 3.8B edge graph. In addition, GVE-LPA scales at a rate of 1.7x every doubling of threads.
14:30	Iacopo Colonnelli Real-world high-performance eScience Applications

14:30-15:00 Session 4: DYNRESHPC - Workshop hackathon

Location: -1.A.04