Program for Tuesday, August 26th

Lise Jolicoeur (University of Bordeaux & CEA, France)
Vanessa Sochat (Lawrence Livermore National Laboratory, United States)
Daniel Milroy (Lawrence Livermore National Laboratory, United States)
François Diakhaté (CEA DAM Ile de France, France, France)

Enabling RDMA and GPUs in Rootless Kubernetes for Accelerated HPC and AI Applications

ABSTRACT. HPC workflows increasingly integrate diverse technologies such as artificial intelligence (AI), machine learning (ML), data analytics, databases, and web services. Orchestrators like Kubernetes have been de-signed to facilitate deploying these heterogeneous workloads in cloud environments. Allowing Kubernetes to be launched and managed as aresource on HPC clusters would facilitate the deployment of modernworkflows in HPC environments. To enable the deployment of Kuber-netes by unprivileged HPC users, we evaluate the usability of a rootlessversion of Kubernetes, Usernetes. We analyze synthetic benchmarks as well as HPC and ML proxy apps to evaluate the overhead of Usernetes for HPC/ML workloads deployed on high performance networks and GPUs. While the results show that applications running in Usernetes can take advantage of InfiniBand networks and NVIDA GPUs, some benchmarks incur measurable overheads at scale which warrant further investigation.

10:00-10:30 Session 24: HeteroPar.1

Chair:

José Cano (University of Glasgow, UK)

Location: BAR 205

10:00

Ivan Donchev Kabadzhov (EURECOM, France)
Jose Mordgado (INESC-ID, Instituto Superior Tecnico, Universidade de Lisboa, Portugal)
Aleksandar Ilic (INESC-ID, Instituto Superior Tecnico, Universidade de Lisboa, Portugal)
Raja Appuswamy (EURECOM, France)

Open, cross-architecture acceleration of data analytics with SYCL and RISC-V

ABSTRACT. The past few years have witnessed the growth in popularity of two standards for accelerating AI and analytics. On the hardware front, the advent of RISC-V, an open instruction set architecture, has ushered in a new era in standards-based design and customization of microprocessors. On the programming front, SYCL has emerged as a cross-vendor, cross-architecture, data parallel programming model for all types of accelerators. In this work, we take the first steps towards bringing these two standards together to enable a new line of work on fully-open, vendor-neutral, cross-architecture-accelerated database engines by developing SYCLDB--a SYCL-based library of key relational operations that works together with the oneAPI Construction Kit (OCK) to target multi-vendor CPU and accelerator backends. Using SYCLDB, we perform a comparative evaluation with micro and macrobenchmarks to show that SYCLDB can (i) exploit vectorization provided by RVV accelerators, (ii) provide performance on-par with CUDA counterparts on NVIDIA GPUs, and (iii) exploit multithreading in x64 and RISC-V CPUs, all the while using a single code base.

10:30-11:00Coffee Break

11:00-12:30 Session 25A: PECS.1

Chair:

Romolo Marotta (Università degli Studi Roma Tre, Italy)

Location: BAR 106

11:00	Meven Mognol (CNRS, France) Florestan De Moor (CNRS, France) Erwan Drezen (Pasteur Institute, France) Yann Falevoz (UPMEM, France) Dominique Lavenier (CNRS, France) Evaluating Energy Efficiency of Genomics Algorithms on Processing-in-Memory Architectures ABSTRACT. Processing-in-Memory (PiM) is a novel computing paradigm for reducing data movements between memory and processing units, and thus minimizing energy consumption. PiMs are particularly well-suited to data-intensive applications, where traditional systems are often limited by memory bandwidth. Genomics is a representative example of such a domain, involving massive datasets and repetitive access patterns. PiM architectures are inherently massively parallel, offering the potential for significant performance and energy gains. However, fully exploiting these architectures requires fine-grained parallelism and efficient load balancing across thousands of processing units. In this paper, we evaluate the energy efficiency improvements achieved by running several genomic algorithms on a PiM-based system. Our experiments focus on realistic workloads and highlight the challenges and opportunities of parallelizing genomic tasks for PiM. The most significant gains are observed in large-scale database search applications, which naturally map to the parallel structure of PiM and benefit greatly from reduced data movement.
11:30	Salvatore Cielo (Leibniz Supercomputing Centre, Germany) Alexander Pöppl (Intel Deutschland GmbH, Germany) Ivan Pribec (Leibniz Supercomputing Centre, Germany) SYCL for Energy-Efficient Computational Astrophysics: the case of DPEcho ABSTRACT. Energy awareness and efficiency policies are gaining more attention, over pure performance (time-to-solution) Key Performance Indicators (KPIs) when comparing the possibilities offered by accelerated systems. But in a field such as numerical astrophysics, which is struggling with code refactorings for GPUs, viable porting paths have to be shown before first. After summarizing the status and recurring problems of astrophysical code accelerations, we highlight how the field would benefit from portable, vendor-agnostic GPU portings. We then employ the DPEcho SYCL benchmark to compare raw performance and energy efficiency for heterogeneous hardware on a realistic application, with the goal of helping computational astrophysicists and HPC providers make informed decisions on the most suitable hardware. Aside from GPUs showing higher efficiency, we argue on the more informative nature of energy-aware KPIs, in that they convey the specific device performance in a data-driven way. We also present a convenient, flexible and cross-platform energy-measuring pipeline. Finally, we contextualize our results through measures with different compilers, presenting device ("at the cores") versus node ("at the plug") energy and comparing DPEcho with the High-Performance Linpack (HPL) benchmark.
12:00	Guillaume Raffin (LIG, Univ. Grenoble Alpes, France) Denis Trystram (Grenoble Alpes university, France) Olivier Richard (LIG Laboratory Grenoble, France) Alumet: a modular framework to standardize the measurement of energy consumption ABSTRACT. The increasing energy consumption of ICT (Information and Communication Technology) has become an important concern driven by the growing demand for computational resources. Measuring the energy consumption is becoming essential to make users aware of their impact, comply with new regulations, compare different solutions and optimize systems. This need has led to the development of various software tools that aim to measure or estimate the energy consumed by a piece of hardware or software. However, existing tools are often very limited in scope, offer too few options to the users, can consume too many resources, may return erroneous data, and have architectures that hinder their progress over time. This paper introduces Alumet, a novel framework designed to offer an extensible standard for energy measurement in a wide range of scenarios. With Alumet, it becomes possible to build made-to-measure measurement tools without the need to redevelop everything from scratch. A generic pipeline enables the gathering, transformation and export of any type of measurement. A plugin system allows to support new environments and new models without modifying the framework, thus making it ready for future innovations. With multiple experiments, we show that Alumet can be deployed in varying contexts, on different hardware and software stacks. The benchmarks reveal that our new tool can facilitate the development of estimation models while reducing the overhead of energy monitoring and supporting higher acquisition frequencies.

11:00-12:30 Session 25B: HeteroPar.2

Chair:

José Cano (University of Glasgow, UK)

Location: BAR 205

11:00	Loris Belcastro (University of Calabria, Italy) Nicola Gabriele (University of Calabria, Italy) Fabrizio Marozzo (University of Calabria, Italy) Alessio Orsino (University of Calabria, Italy) Domenico Talia (University of Calabria, Italy) Paolo Trunfio (University of Calabria, Italy) Rosa María Badia (Barcelona Supercomputing Center, Spain) Francesc Lordan (Barcelona Supercomputing Center, Spain) Federated Learning in the Edge-Cloud Continuum: A Task-Based Approach with Colony PRESENTER: Alessio Orsino ABSTRACT. The edge-cloud continuum enables distributed machine learning by leveraging the complementary strengths of edge devices and centralized cloud resources. Federated Learning (FL) has emerged as a key paradigm in this context, allowing collaborative model training across multiple parties without sharing raw data, thus preserving privacy, reducing communication costs, and supporting compliance with data protection regulations. However, orchestrating FL workflows across heterogeneous edge-cloud environments introduces significant challenges related to task coordination, resource management, and scalability. In this paper, we propose using the Colony framework to address these challenges through a task-based approach to FL. Colony allows developers to define an FL workflow as parallel tasks automatically scheduled across heterogeneous resources. We show how this task-based model supports core FL operations—such as local training, model aggregation, and synchronization—within a unified execution framework. Experiments on a medical imaging use case demonstrate that Colony enables scalable and efficient orchestration of FL tasks across heterogeneous environments while ensuring that sensitive data remain local. These results highlight the applicability and advantages of task-based programming models for privacy-preserving machine learning across the compute continuum.
11:30	Juan José Ropero (Universidad de Valladolid, Spain) Manuel de Castro (Universidad de Valladolid, Spain) Diego R. Llanos (Universidad de Valladolid, Spain) OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing ABSTRACT. As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version of the OpenDwarfs benchmark suite, originally developed to evaluate the performance of heterogeneous systems using OpenCL. Our comprehensive reengineering process involved addressing compatibility issues with modern compilers, resolving bugs, and enhancing usability to align the suite with the latest hardware advancements. Key updates include improved scalability, standardized configurations, and enriched documentation, making the suite more accessible to the research community. Experimental results highlight the enhanced performance and portability of OpenDwarfs 2025 across diverse platforms, offering valuable insights into the evaluation of parallel computing systems amidst rapidly advancing architectures.
12:00	Måns I. Andersson (KTH, Sweden) Hugo Martin Christian Karp (KTH Royal Institute of Technology, Sweden) Niclas Jansson (PDC, Sweden) Stefano Markidis (KTH Royal Institute of Technology, Sweden) Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe ABSTRACT. With the emergence of new high-performance computing (HPC) accelerators, such as Nvidia and AMD GPUs, efficiently targeting diverse hardware architectures has become a major challenge for HPC application developers. The increasing hardware diversity in HPC systems often necessitates the development of architecture-specific code, hindering the sustainability of large-scale scientific applications. In this work, we leverage DaCe, a data-centric parallel programming framework, to automate the generation of high-performance kernels. DaCe enables automatic code generation for multicore processors and various accelerators, reducing the burden on developers who would otherwise need to rewrite code for each new architecture. Our study demonstrates DaCe's capabilities by applying its automatic code generation to a critical computational kernel used in Computational Fluid Dynamics (CFD). Specifically, we focus on Neko, a Fortran-based solver that employs the spectral-element method, which relies on small tensor operations. We detail the formulation of this computational kernel using DaCe's Stateful Dataflow Multigraph (SDFG) representation and discuss how this approach facilitates high-performance code generation. Additionally, we outline the workflow for seamlessly integrating DaCe's generated code into the Neko solver. Our results highlight the portability and performance of the generated code across multiple platforms, including Nvidia GH200, Nvidia A100, and AMD MI250X GPUs, with competitive performance results. By demonstrating the potential of automatic code generation, we emphasize the feasibility of using portable solutions to ensure the long-term sustainability of large-scale scientific applications.

11:00-12:00 Session 25C: VHPC.2

Chair:

Michael Alexander (Austrian Academy of Sciences, Austria)

Location: BAR 218

11:00

Manoj Patra (LNMIIT, Jaipur & National Institute of Technology, Rourkela, India)
Vijayalakshmi Saravanan (University of Texas at Tyler & BNL, United States)
Khaled Ibrahim (Lawrence Berkeley National Labratory, United States)

Performance Analysis of Container-in-VM Architectures: A Study on Hypervisor Isolation and Lightweight OS Integration

ABSTRACT. Container-in-VM architectures have become essential in modern cloud-native infrastructure, offering a secure yet flexible approach to deploying containerized workloads. This paper comprehensively reviews the architectural and performance trade-offs associated with deploying containers inside virtual machines, focusing on hypervisor-level isolation and lightweight operating system integration. We systematically examine a decade of research and industry innovations, synthesizing findings from benchmarking studies and architectural implementations. The paper introduces a comparative framework covering isolation strength, runtime overhead, boot latency, OS footprint, and orchestration compatibility, highlighting how solutions like Firecracker, Kata Containers, and unikernels deliver varying balances of efficiency and security. A significant contribution is the analytical contrast between general-purpose and minimalistic operating systems when used as VM guests, revealing substantial gains in performance and attack surface reduction with specialized OS choices. Similarly, hypervisor technologies are evaluated interms of their ability to support microVMs and fast container lifecycle management, with observations drawn from real-world deployments and empirical studies. The findings demonstrate that combining lightweight OS stacks with hardware-assisted virtualization can significantly enhance container performance, though challenges persist in orchestration, observability, and standardization. This review identifies critical gaps andoffers direction for future research in building secure, efficient, and scalable container-in-VM platforms suitable for cloud, edge, and serverless environments.

11:30

Enrico Fiasco (University of Pisa, Italy)
Marco Danelutto (University of Pisa, Italy)
Patrizio Dazzi (University of Pisa, Italy)

WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge

ABSTRACT. Serverless computing at the edge requires lightweight execution environments to minimize cold start latency, especially in Urgent Edge Computing (UEC). This paper compares WebAssembly and unikernel-based MicroVMs for serverless workloads. We present Limes, a WebAssembly runtime built on Wasmtime, and evaluate it against the Firecracker-based environment used in SPARE. Results show that WebAssembly offers lower cold start times for lightweight functions but suffers with complex work-loads, while Firecracker provides higher, but stable, cold starts and better execution performance, particularly for I/O-heavy tasks.

12:30-14:00Lunch Break

14:00-15:30 Session 26A: PECS.2

Chair:

Romolo Marotta (Università degli Studi Roma Tre, Italy)

Location: BAR 106

14:00	Marcelo Augusto Sudo (Federal University of Sao Paulo (UNIFESP), Brazil) Alvaro Luiz Fazenda (Federal University of Sao Paulo (UNIFESP), Brazil) Roberto Pinto Souto (National Scientific Computing Laboratory (LNCC), Brazil) Mixed precision over GPU applied to a Microphysics model ABSTRACT. In high-performance computing, mixed-precision approaches optimize computational efficiency while maintaining acceptable accuracy. This study evaluates mixed-precision arithmetic (FP64, FP32, FP16) in the MPAS atmospheric model, focusing on the WSM6 microphysics scheme accelerated via GPU. We analyze trade-offs between precision, performance, and energy efficiency, proposing methods to control accu- racy loss. Experimental results demonstrate that reducing precision from FP64 to FP16 yields a 2.87× speedup and 70.72% improvement in energy efficiency (3.20 GFLOPS/W for FP16 vs. 0.94 for FP64). Accuracy, as- sessed via M SE, RM SE and M SEnorm for hydrometeor variables (e.g., water vapor qv), remains within acceptable limits, with FP16 M SEnorm of 4×10−2 for the variable exhibiting the most significant difference com- pared to FP64 benchmarks. While FP32 offers the best balance, FP16 proves viable for scenarios tolerant to minor precision loss. This work confirms that mixed precision can enhance large-scale climate simula- tions without compromising meteorological validity, providing actionable insights for energy-aware HPC deployments.
14:30	Botond Szirtes (Eötvös Loránd University, Hungary) Melinda Tóth (Eotvos Lorand University, Budapest, Hungary) Comparative Analysis of Energy Efficiency in Actor-Based Applications in Distributed Environments ABSTRACT. As energy efficiency becomes vital in scalable software design, understanding the energy behavior of concurrent and distributed programming models is increasingly important. Actor-based frameworks are widely used for building fault-tolerant distributed applications, but their energy characteristics remain underexplored in real-world deployment scenarios. This paper presents a comparative analysis of actor-based applications written in Erlang and the C++ Actor Framework (CAF), focusing on both native and distributed execution. Using a reproducible benchmarking framework with power telemetry, this study evaluates energy and runtime performance under varying workloads. Results show that CAF is more efficient in native, compute-heavy tasks, while Erlang outperforms in distributed scenarios due to its integrated support for concurrency and inter-node communication. These findings highlight the role of deployment context in shaping energy usage.
15:00	Max Lübke (University of Potsdam, Germany) Dorian Stoll (University of Potsdam, Germany) Bettina Schnor (University of Potsdam, Germany) Stefan Petri (Potsdam Institute for Climate Impact Research (PIK), Germany) HPC Benchmark Game: Comparing Programming Languages Regarding Energy-Efficiency for Applications from the HPC Field ABSTRACT. This paper presents a benchmark suite for the HPC field, called the HPC Benchmark Game which allows comparing programming languages and compilers regarding their runtime performance and energy-efficiency. We started with 3 compiled languages (C, C++ and Fortran) and Julia which is a just-in-time compiled language. Julia has native support for threads, distributed computing, and GPU offloading, which makes it a promising candidate for HPC. For each language, we picked one benchmark as reference and re-implemented it for the other languages. This paper describes our guidelines for the re-implementation. Further, we demonstrate the benefit of the Benchmark Suite through measurements on a 128-core node. The results help an HPC programmer to decide which languages and compilers are recommended on a system for energy-efficiency. The presented results show that HPC developers still have to invest some effort to find an energy-efficient implementation of their algorithm on modern manycore architectures.

14:00-15:30 Session 26B: HeteroPar.3

Chair:

José Cano (University of Glasgow, UK)

Location: BAR 205

14:00	Martin Rose (University of Stuttgart, Germany) Simon Homes (Technische Universität Berlin, Germany) Lukas Ramsperger (University of Stuttgart, Germany) Jose Gracia (High Performance Computing Center Stuttgart, Germany) Christoph Niethammer (HRLS, Universität Stuttgart, Germany) Jadran Vrabec (Thermodynamics and Process Engineering, Technical University of Berlin, Germany) Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics PRESENTER: Martin Rose ABSTRACT. In the quest for highest performance in scientific computing, we present a novel framework that relies on high-bandwidth communication between GPUs in a compute cluster. The framework offers linear scaling of performance for explicit algorithms that is only limited by the size of the dataset and the number of GPUs. Slices of the dataset propagate in a ring of processes (GPUs) from one GPU, where they are processed, to the next, which results in a parallel-in-time parallelization. The user of the framework has to write GPU kernels that implement the algorithm and provide slices of the dataset. Knowledge about the underlying parallelization strategy is not required because the communication between processes is carried out by the framework. As a case study, molecular dynamics simulation based on the Lennard-Jones potential is implemented to measure the performance for a homogeneous fluid. Single node performance and strong scaling behavior of this framework is compared to LAMMPS, which is outperformed in the strong scaling case.
14:30	Ivan Tagliaferro de Oliveira Tezoto (CNRS/CRIStAL UMR 9189, Centre Inria de l’Université de Lille, France; University of Luxembourg, SnT, Luxembourg, France) Guillaume Helbecque (Université de Lille, CNRS/CRIStAL UMR 9189, Centre Inria de l’Université de Lille, France, France) Ezhilmathi Krishnasamy (University of Luxembourg, FSTM-DCS, Luxembourg, Luxembourg) Nouredine Melab (Université de Lille, CNRS/CRIStAL UMR 9189, Centre Inria de l’Université de Lille, France, France) Grégoire Danoy (University of Luxembourg, FSTM-DCS, Luxembourg; University of Luxembourg, SnT, Luxembourg, Luxembourg) A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems ABSTRACT. Modern supercomputers are often heterogeneous, typically using GPUs from either Nvidia or AMD, which constrains implementation choices for parallel applications. This investigation considers two implementation approaches within the context of parallel tree-based exact optimization, using the Branch-and-Bound (B&B) algorithm as a test case. The first combines OpenMP with GPU programming APIs, such as CUDA and HIP, to exploit multiple GPUs within one compute node for different vendor architectures. The second one is based on the PGAS-based Chapel language, which treats threaded and GPU programming in a unified and portable way. We revisit the design of a portable multi-GPU-accelerated Chapel implementation of the B&B algorithm and propose a C-based counterpart. Our contribution involves a low-level implementation of its multi-pool data structure equipped with a dynamic load balancing mechanism, and a GPU optimization specific to a low-level setup. The two approaches are applied to the Permutation Flowshop Scheduling Problem using up to 8 GPUs, and tested on AMD MI250x and Nvidia A100 GPU architectures. Our results show substantially better performance and scalability on both architectures compared to Chapel implementation. An in-depth study of the load balancing mechanism in a multi-GPU setup provides an analysis of the algorithm’s design choices. These findings highlight the performance advantages of low-level implementations using CUDA and HIP over Chapel, while maintaining portability across different GPU architectures.
15:00	Joachim Jenke (RWTH Aachen University, Germany) Ben Thärigen (RWTH Aachen University, Germany) Kaloyan Ignatov (RWTH Aachen University, Germany) Tobias Dollenbacher (RWTH Aachen University, Germany) Simon Schwitanski (RWTH Aachen University, Germany) Tracking the Critical Path of Execution for GPU Offloading Applications ABSTRACT. The critical path of execution determines the lower bound of the parallel execution time of a program. Tracking the critical path of execution for GPU offloading applications involves analyzing the sequence of tasks and dependencies that dictate how data is processed on a GPU. Such critical path analysis is essential for optimizing performance and ensuring the workload is balanced effectively between the CPU and GPU or multiple GPUs. Key steps include identifying bottlenecks, measuring execution time for each task, and adjusting the workflow to minimize delays. Techniques such as profiling tools and execution graphs can help visualize the critical path, providing insights into optimizing GPU utilization and overall application efficiency. By focusing on the critical path, developers can enhance the performance of applications that leverage GPU acceleration. In this work, we identify the key tool entry points to track the critical path for applications using GPU offloading based on CUDA or OpenMP target offloading. While CUDA offloading might provide better performance, OpenMP offloading provides more control to define dependencies between compute tasks. We extend the on-the-fly critical path tool (OTF-CPT) to support GPU offloading. With this integration, we can follow the critical path of execution through GPU kernels, MPI communication, and all of the OpenMP synchronization features.

15:30-16:00Coffee Break

16:00-17:00 Session 27A: PECS.3

Chair:

Romolo Marotta (Università degli Studi Roma Tre, Italy)

Location: BAR 106

16:00

Abdessalam Benhari (Université de Grenoble Alpes, France)
Yves Denneulin (Université de Grenoble Alpes, France)
Frédéric Desprez (INRIA, France)
Fanny Dufossé (INRIA, France)
Denis Trystram (Université de Grenoble Alpes, France)

Analysis of the carbon footprint of HPC

ABSTRACT. The demand in computing power has never stopped growing over the years. Today, the performance of the most powerful systems exceeds the exascale. Unfortunately, this growth also comes with ever-increasing energy costs, leading to a high carbon footprint. This paper investigates the evolution of high performance systems in terms of carbon emissions. A lot of studies focus on Top500 (and Green500) as the tip of an iceberg to identify trends in the domain in terms of computing performance. We propose here to go further in considering the whole span life of several large scale systems and to link the evolution with trajectory toward 2030. More precisely, we introduce the energy mix in the analysis of Top500 systems and we derive a predictive model for estimating the weight of the HPC domain for the next 5 years.

16:30

Miray Ozcan (Minerva University, United States)
Philipp Wiesner (Technical University Berlin, Germany)
Philipp Jan Weiß (Technical University Berlin, Germany)
Odej Kao (Technical University Berlin, Germany)

Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations

ABSTRACT. The environmental impact of Large Language Models (LLMs) is rising significantly, with inference now accounting for more than half of their total lifecycle carbon emissions. However, existing simulation frameworks, which are increasingly used to determine efficient LLM deployments, lack any concept of power and, therefore, cannot accurately estimate inference-related emissions. We present a simulation framework to assess the energy and carbon implications of LLM inference under varying deployment setups. First, we extend a high-fidelity LLM inference simulator with a GPU power model that estimates power consumption based on utilization metrics, enabling analysis across configurations like batch size, sequence length, and model parallelism. Second, we integrate simulation outputs into an energy system co-simulation environment to quantify carbon emissions under specific grid conditions and explore the potential of carbon-aware scheduling. Through scenario-based analysis, our framework reveals how inference parameters affect energy demand and carbon footprint, demonstrates a renewable offset potential of up to 69.2% in an illustrative deployment case, and provides a foundation for future carbon-aware inference infrastructure design.

16:00-17:30 Session 27B: HeteroPar.4

Chair:

José Cano (University of Glasgow, UK)

16:00	Allen Malony (University of Oregon, United States) Michael Dushkoff (University of Oregon, United States) Grace McLewee (University of Oregon, United States) Kevin Huck (AMD, United States) SIMON: A Simple Monitoring Framework for Heterogeneous Application Observability ABSTRACT. Sophisticated observability solutions have been developed in cloud computing environments to address the complexity of scaled-out distributed enterprise systems and understand the factors affecting application execution and performance in otherwise opaque cloud operations. In contrast, observability in high-performance computing has focused more on detailed measurements of application execution on processors (CPUs, GPUs), memory, interconnection networks, and I/O for purposes of performance optimization. With increasing heterogeneity in HPC hardware, software, and application types, monitoring to observe application measurements at runtime and enable telemetry for in situ processing is gaining in importance for purposes of adaptive execution and dynamic resource management. The paper presents a simple monitoring framework for heterogeneous HPC application observability called SIMON. By ``simple'' we mean that SIMON should offer functionality that is easy to use, works out-of-the-box with heterogeneous applications and systems, is programmable and extensible, and can be configured to meet a range of observability requirements. A SIMON prototype is presented and examples shown for different heterogeneous monitoring scenarios that demonstrate its capabilities.
16:30	Manuel de Castro Caballero (Universidad de Valladolid, Spain) Sergio Alonso Pascual (Universidad de Valladolid, Spain) Rubén Gran Tejero (Universidad de Zaragoaza, Spain) Yuri Torres (Universidad de Valladolid, Spain) Arturo Gonzalez-Escribano (Universidad de Valladolid, Spain) Exploiting highly heterogenous systems with stencil applications ABSTRACT. While CPUs, GPUs, and FPGAs present particular advantages for different classes of High Performance Computing Applications, Iterative Stencil Loop (ISL) is a class of parallel applications with efficient implementations for all of them. Thus, this class is an interesting case of use to test the potential of using simultaneously different classes of heterogeneous devices. EPSILOD is a parallel skeleton framework designed to easily program and deploy ISL applications on heterogeneous platforms. It provides a programming abstraction and a transparent coordination system to work with different types of devices. In this work, we improve EPSILOD and its performance portability layer to support key features for developing and operating optimized kernels on FPGAs. The new EPSILOD version allows the execution of efficient stencil programs simultaneously on CPUs, GPUs, and FPGA accelerators. We discuss how to test the computing power of each different device and use this information in an EPSILOD data-partition policy to obtain a balanced load distribution across them. We present an experimental study using a classical heat-transfer stencil example in a highly heterogeneous system to show the efficiency of EPSILOD programs when using a CPU, a GPU, and an FPGA simultaneously.
17:00	Marcelo Torres Do Ó (Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Brazil) Daniel Cordeiro (Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Brazil) Veronika Rehn-Sonigo (Université Marie et Louis Pasteur, CNRS, FEMTO-ST institute, France) Green Energy Aware Scheduling of Scientific Workflows with Flexible Deadlines ABSTRACT. Scientific workflows require significant computational power, resulting in considerable energy consumption and carbon emissions. While traditional scheduling techniques reduce energy usage through machine heterogeneity, they remain constrained by hardware attributes. Renewable energy sources are an alternative to minimize environmental impact. Solar energy, though intermittent, creates temporal energy heterogeneity that can be leveraged even across homogeneous machines. However, the intermittency may lead to task delay, increasing the workflow’s finish time (makespan), a key user concern. We propose a renewable energy-aware scheduling algorithm to minimize non-renewable energy usage and makespan. The user must provide the algorithm with a deadline, which the algorithm uses to delay the tasks to use more renewable energy. Our evaluation using real and synthetic workflows demonstrates that, depending on the user’s flexibility, the usage of brown energy can be reduced drastically. The algorithm can save 99.98% of non-renewable energy with a makespan increase of 137.37%. Under limited renewable energy availability, non-renewable energy usage is still reduced by 12.50%, with only a 10.28% increase in makespan.

17:00-17:10 Session 28: PECS: Closing Session

Location: BAR 106

17:30-17:35 Session 29: HeteroPar: Closing Session

Location: BAR 205

19:00-22:30 Welcome Reception