EURO-PAR 2025: 31ST INTERNATIONAL EUROPEAN CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING
PROGRAM FOR MONDAY, AUGUST 25TH
Days:
next day
all days

View: session overviewtalk overview

10:00-10:30 Session 6A: HiPES.1
10:00
Francesco De Micco (University of Naples "Parthenope", Italy)
Francesca Formisano (University of Naples "Parthenope", Italy)
Diana Di Luccio (University of Naples Parthenope, Italy)
Ciro Giuseppe De Vita (University of Naples "Parthenope", Italy)
Gennaro Mellone (University of Napoli "Parthenope", Italy)
Dante Sánchez-Gallegos (Universidad Carlos III de Madrid, Spain)
Pasquale De Luca (University of Naples "Parthenope", Italy)
Emanuel Di Nardo (University of Naples "Parthenope", Italy)
Vincenzo Capozzi (University of Naples "Parthenope", Italy)
Angelo Ciaramella (University of Naples "Parthenope", Italy)
A framework for flooding early warning leveraging AI, HPC, and computing continuum

ABSTRACT. The need to obtain timely short-term weather forecasts is an increasingly pressing and relevant topic. Although groundbreaking improvements in weather forecast computational modeling on the latest high-performance computing architectures have been made, intense rain events and related flooding effects in inhabited areas are sub-grid phenomena that are still hard to detect using physics-based simulations. In this scenario, coupling weather numerical models with run-off and flooding models to generate early warnings is technologically possible but prone to inconsistent results. However, using weather radar data acquired in real-time to generate synthetic nowcasted data with an outlook of a few hours could be a feasible solution, leveraging the paradigm of the computational continuum. In this work, we present a scientific workflow to produce high-resolution flooding early warning maps, coupling weather radar acquisition, synthetic weather radar data, and a parallel flooding model. The early results are promising, demonstrating how the system prototype can produce high-resolution flooding risk nowcasts, improving the results' details regarding the currently available systems.

10:10-10:30 Session 7: Workshop WSCC.1
10:10
Cuong Pham-Quoc (HCMUT, Viet Nam)
Minh-Thu Le-Ngoc (Ho Chi Minh City University of Technology (HCMUT), Viet Nam)
Nhat Huynh-Trung (Ho Chi Minh City University of Technology (HCMUT), Viet Nam)
Thanh-Thien Do-Huu (Ho Chi Minh City University of Technology (HCMUT), Viet Nam)
Efficient FPGA-based GAN Accelerator Core for Edge-AI Platforms

ABSTRACT. Generative Adversarial Networks (GANs) have shown remarkable success in image generation and related tasks; however, their high computational complexity poses challenges for deployment in edge computing environments. This paper presents an efficient hardware architecture for accelerating the generator component of GANs using Field Programmable Gate Arrays (FPGAs), with a focus on real-time performance and energy efficiency. The proposed system employs a fully pipelined and parallelized design incorporating four Deconvolution Multi-Kernel Processors (DCMKPs), optimized for spatial processing of image segments. Implemented in Verilog and evaluated on the Xilinx Kria KV260 and ZCU106 platforms, benchmarking with the MNIST and CelebA datasets demonstrates the effectiveness of the system. The design achieves up to 3.17× speed-up compared to the ARM Cortex-A53 and over 10× energy efficiency relative to an Intel Core i3-1005G1 CPU, achieving 149.7 GOPs while maintaining low power consumption (4.2 W). The results confirm the feasibility of deploying high-performance GANs on resource-constrained edge devices, establishing a foundation for scalable and adaptive edge-AI applications.

10:30-11:00Coffee Break
11:00-12:20 Session 8A: WSCC.2
11:00
Klaus Nölp (University of Hagen, Germany)
Lena Oden (University of Hagen, Germany)
Simplifying distributed workflows: A portable approach for Cloud and HPC

ABSTRACT. Managing computational workflows often presents challenges due to complex systems and specialized languages, hindering adoption and portability. This work introduces a simplified, portable approach for executing distributed workflows in Python, leveraging the Parsl library for task definition and Globus Compute for remote execution. By combining these tools, we enable streamlined workflow deployment across Cloud and HPC environments, including Kubernetes and Slurm. We demonstrate the approach’s flexibility by executing both an RNA sequencing workflow in Python and a Common Workflow Language (CWL) demonstrator, highlighting support for heterogeneous tools and resource managers. Our results demonstrate a significant reduction in deployment complexity and improved portability.

11:20
Laurent Morin (Université Rennes, France)
François Bodin (Université Rennes, France)
Germaine Nyastikor (Université Rennes, France)
HPC Software as a Service: A Flexible Approach to Data Logistics

ABSTRACT. The use of High-Performance Computing (HPC) facilities is rapidly evolving. These facilities are now being leveraged to process the vast amounts of data generated by scientific instruments. While cloud facilities are designed to provide scalable processing capacity, the operational constraints of HPC centers prevent deploying applications in the same way.

The need to support cross-facility workflows, which involves heterogeneous tasks (IA, compute, HPDA) distributed across multiple infrastructures (the digital continuum), is another critical aspect of this evolution. These workflows can be seen as a collaborative system of systems, each system having its own operational modus operandi.

In this paper, we propose an approach to deploy scientific applications as a service (SaaS) on HPC systems. This proposal aims to design a trade-off that can be deployed on the majority of systems. To do so, we rely on task runners extended in a way to provide two capabilities: the execution of jobs on the host system (application service), and a mechanism for implementing data logistics.

The proposed architecture is designed to enable scientific application developers to deploy their code as a SaaS, making it available to the community. We address how to build task runners compatible with a workflow language and introduce the concept of Ephemeral Buffers to manage data logistics between tasks. Within a collaborative system-of-systems context, we minimized the need for strong inter-infrastructures collaboration in compliance with cybersecurity rules and HPC center regulation.

11:40
Carlos Barrios Hernandez (SC3UIS-CAGE, LIG/INRIA-DataMove, CITI/INRIA -Sindy, Colombia)
Yves Denneulin (Grenoble-INP - LIG/INRIA, France)
Frederic Le Mouel (INSA - CITILAB, France)
A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum

ABSTRACT. Computing Continuum systems feature various levels with vertical and horizontal scales. As a collaborative platform, they integrate diverse processing resources, including a wide range of computational environments and services aimed at enhancing performance and efficiency. Analyzing components highlights focus areas that may introduce complexity, complicating the evaluation of technical elements against requirements, application efficiency, and service performance regarding energy use. This paper presents a novel multidimensional analysis approach, utilizing key metrics to clarify system characteristics and behaviors. This framework combines technical performance management through Quality of Service (QoS), expectation definitions via Service Level Agreements (SLAs), and workload satisfaction through Quality of Experience (QoE) across two tiers of a Computing Continuum system. This perspective helps identify vital traits of multiscale systems, improving the assessment and enhancement of Computing Continuum services.

12:00
Mohsen Seyedkazemi Ardebili (University of Bologna, Italy)
Andrea Bartolini (University of Bologna, Italy)
Light Weight Scalable DevOps for Cloud Robotics

ABSTRACT. Cloud robotics applications leverage the scalability, computational power, and advanced machine learning capabilities of cloud infrastructures, offering transformative potential across industries. However, their development is hindered by challenges such as dynamic resource allocation, seamless inter-process communication, and scalable orchestration of distributed systems. This paper presents a lightweight and scalable DevOps framework that integrates Robot Operating System 2 (ROS2) with Kubernetes, automating deployment, monitoring, and resource management. The framework utilizes Vagrant for virtual machine provisioning, Docker for containerization, Helm for package management, and Prometheus-Grafana for real-time monitoring, achieving a full deployment in just 15 minutes while ensuring iterative prototyping and resource optimization on resource-constrained devices.

This work advances the state of the art by enabling robust inter-pod communication and dynamic scaling of ROS2 applications. Our framework demonstrates remarkable scalability, expanding from 1 to 326 pods in just 370 seconds—even with limited hardware resources. Through seamless CI/CD integration, the framework reduces barriers for researchers and practitioners, accelerating innovation in cloud robotics. It lays the foundation for developing accessible, scalable, and cloud-native robotic systems, facilitating advancements in fields such as autonomous vehicles, industrial automation, and telemedicine.

11:00-12:30 Session 8B: DynResHPC.1
11:00
Dominik Huber (Technical University Munich, Germany)
Martin Schreiber (Université Grenoble Alpes, France)
Martin Schulz (Technical University Munich, Germany)
Howard Pritchard (Los Alamos National Laboratory, United States)
Daniel Holmes (Intel Corporation, UK)
Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models

ABSTRACT. With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system’s perspective, DRM opens a new level of flexibility in resource allocation and job scheduling, and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness, and is key for efficiency in scenarios like urgent computing jobs. From the application perspective, users can tailor the resources they request to their needs, offering potential optimizations in queuing time or charged costs.

Despite these well-known advantages and many attempts over the last decade to establish DRM in HPC, it remains a concept discussed in academia rather than being successfully deployed on production systems. This stems from the fact that support for DRM requires complex changes in all the layers of the HPC system software stack, including applications, programming models, process managers, and resource management software, as well as an extensive and holistic co-design process to establish new techniques and policies for scheduling and resource optimization. Currently, there is a lack of a formal definition of a dynamic resource management design leading to incompatibilities, a lack of flexibility, and lost optimization potential.

In this work, we formally define a set of generic design principles for DRM in HPC, aiming to prevent the specialization lock-in of past approaches and pave the way for common standards. We also explore how these principles can be integrated into the HPC software stack. We view this work as a crucial foundation for developing DRM approaches suitable for production HPC systems.

11:30
Aleksei Fedotov (Intel, Germany)
Michael Voss (Intel, United States)
Ilya Isaev (Intel, Germany)
A Case Study for Resolving Composability Issues Using a Shared CPU Resource Coordinator

ABSTRACT. Modern computer systems have an increasingly large number of phys-ical CPU cores and software applications that run on these systems are increas-ingly modular, using a diverse set of software components, frameworks, and plugins. The developers of these software components introduce parallelism into their code to make use of the power that exists on these modern computers. If all layers and components target a common thread pool, such as Windows Thread Pool or Grand Central Dispatch, composability is achieved but often by sacrific-ing control over parallel performance. More performance-oriented options such as Threading Building Blocks (TBB) or OpenMP provide more user-control and performance hooks but maintain their own thread pools. This paper introduces a novel permit-based arbitrator, Thread Composability Manager (TCM), that serves as a shared CPU resource coordinator among parallel runtimes that do not share a common thread pool, helping them to negotiate for CPU resources to avoid performance composability issues, while retaining the performance hooks that make these models attractive in the first place. In this paper, we give an over-view of performance composability issues that can arise when mixing threading runtimes and demonstrate relevance to current applications. We then propose our novel TCM arbitrator and contrast it with existing approaches to resolving these issues. We provide an evaluation of TCM on three workloads, demonstrating how it performs in comparison with known workarounds. We then present plans for future work and our conclusions.

12:00
Tiberiu Rotaru (Fraunhofer Institute For Industrial Mathematics ITWM, Germany)
Rui Machado (Fraunhofer Institute for Industrial Mathematics ITWM, Germany)
Experimental Evaluation of Scheduling Strategies for Evolving Workflow-Based Applications

ABSTRACT. Dynamic resource allocation for parallel applications is regarded as a promising feature that can help to improve the efficient use of modern high-performance computing systems, which are becoming increasingly complex and costly to run. The current job management systems do not fully exploit this capability, preventing applications from taking full advantage of it. At the same time, adding support for it entails significant changes to the system's software stack. In this context, the development of scheduling strategies capable of supporting dynamic resource allocation is a key aspect to address. Ideally, such strategies should aim to optimize multiple objectives and should be extensively tested before deployment in production, to avoid deteriorating the quality of services that users are accustomed to. However, the lack of adaptive workloads produced by real systems that can be reused in testing scheduling heuristics, along with the absence of a standardized format to describe them, represents a serious limiting factor in the design of appropriate strategies. This paper presents a methodology for modeling experimental evolving workloads, demonstrating their use in designing and evaluating scheduling heuristics that support dynamic on-demand resource allocation.

11:00-12:30 Session 8C: HiPES.2
11:00
Marco Edoardo Santimaria (University of Torino, Italy)
Adriano Marques Garcia (University of Torino, Italy)
Giulio Malenza (University of Torino, Italy)
Stefano Monaldi (University of Torino, Italy)
Marco Aldinucci (University of Torino, Italy)
Robert Birke (University of Torino, Italy)
Thread Monitoring Tool: transparent characterization of threading patterns with eBPF

ABSTRACT. Controlling the number of threads in parallel computation is a key tuning factor for performance. If too few, the program does not leverage all the available hardware resources. If too many, thread management overhead and bottleneck-resource congestion slow down execution. A good rule-of-thumb starting value is often one software thread for each hardware core. However, complex software may use multiple parallelism at different (nested) levels, making it difficult to predict what is created during execution. Moreover, classic monitoring tools only sample the process tree at given time instants, which misses any thread not yet created or already destroyed before sampling happens. Here, we present a thread monitoring tool that traces kernel events for continuous tracking of threads spawned and joined by a program during its execution with little to no overhead. Evaluation of simple applications running with different parallel backends, as well as PyTorch with the LAMA-3.2-1B and Bzip2 running with two different parallel backends, shows the tool's capabilities.

11:30
Tommaso Foglio Bonda (University of Turin, Italy)
Doriana Medic (University of Turin, Italy)
Alberto Mulone (University of Turin, Italy)
Marco Aldinucci (University of Turin, Italy)
Accelerating SWIRL Workflows: A High-Performance Rust Backend for Distributed Execution

ABSTRACT. As digital technologies advance, the scientific community is increasingly focused on running complex workflows aiming to reduce the computational costs by optimizing hardware usage, data movement, and task orchestration. This work presents the implementation of a Rust compilation target for SWIRL, an intermediate representation language for distributed scientific workflows. The main goal is to enhance the execution of distributed workflows by leveraging Rust's system programming capabilities. The architecture of the Rust target proposed in this work is built around two main components: the Orchestra module, which handles low-level communications, and SWIRL-rs, which integrates with SWIRL.

Experimentation on the synthetic workflow and communication tests showed a performance improvement of approximately 30% compared to the Python version and nearly 400% when evaluating the broadcast implementation. Further experiments on the 1000 Genomes workflow demonstrate a consistent improvement in performance.

12:00
Eugenio Cesario (University of Calabria, Italy)
Salvatore Giampà (Relatech Group, Italy)
Domenico Talia (University of Calabria, Italy)
Building Parallel Machine Learning Workflows in PyCOMPSs: The Case Study of Tsunami Forecasting

ABSTRACT. Workflows are an effective and widely used formalism to represent the data and execution flows associated with complex data analysis and learning tasks in scientific applications. In this paper we present and discuss a real-world use case of a scientific application based on parallel workflows implemented by the PyCOMPSs programming framework.We describe the design and implementation of a parallel machine learning approach based on the workflow formalism for discovering tsunami predictive models from simulation data. The experimental evaluation has been performed on real data related to the Zemmouri-Boumerdes earthquake and the consequent tsunami occurred in the Western Mediterranean. The paper discusses how the proposed solution is effective and scalable.

11:00-11:50 Session 8D: GraphSys.1
11:00
Shaoshuai Du (University of Amsterdam, Netherlands)
Joze Rozanec (University of Twente, Netherlands)
Ana Lucia Varbanescu (University of Twente, Netherlands)
Andy D. Pimentel (University of Amsterdam, Netherlands)
A Comparative Study of Streaming Graph Processing Systems

ABSTRACT. Streaming Graph Processing Systems (SGPSs) are essential for real-time analytics on dynamic graphs in domains such as social networks and knowledge graphs. Despite increasing interest and a growing number of SGPSs, practical performance comparisons remain limited due to architectural heterogeneity and inconsistent evaluation practices.

This work presents a unified benchmarking workflow for empirically evaluating SGPSs across latency, resource usage, and energy efficiency. We select three representative systems—GraphStream (GS), GraphBolt (GB), and RisGraph (RG)—to reflect classical and incremental design philosophies, and evaluate them using two common graph algorithms (BFS and SSSP) across diverse real-world datasets.

Our results reveal distinct trade-offs: RG excels in low-latency tasks but supports only monotonic algorithms; GB achieves strong batch performance and broader algorithm support; GS maintains stable latency at the cost of higher memory and energy consumption. We also conduct the first empirical comparison of SGPS energy efficiency.

Our findings offer practical guidance for system selection and provide a reproducible foundation for future SGPS benchmarking and optimization.

11:25
Junaid Ahmed Khan (University of Bologna, Italy)
Andrea Bartolini (University of Bologna, Italy)
A Unified Ontology for Scalable Knowledge Graph–Driven Operational Data Analytics in High-Performance Computing Systems

ABSTRACT. Modern high-performance computing (HPC) systems generate massive volumes of heterogeneous telemetry data from millions of sensors monitoring compute, memory, power, cooling, and storage subsystems.As HPC infrastructures scale to support increasingly complex workloads—including generative AI—the need for efficient, reliable, and interoperable telemetry analysis becomes critical. Operational Data Analytics (ODA) has emerged to address these demands; however, the reliance on schema-less storage solutions limits data accessibility and semantic integration. Ontologies and knowledge graphs (KG) provide an effective way to enable efficient and expressive data querying by capturing domain semantics, but they face challenges such as significant storage overhead and the limited applicability of existing ontologies, which are often tailored to specific HPC systems only. In this paper, we present the first unified ontology for ODA in HPC systems, designed to enable semantic interoperability across heterogeneous data centers. Our ontology models telemetry data from the two largest publicly available ODA datasets—M100 (CINECA, Italy) and F-DATA (Fugaku, Japan)—within a single data model. The ontology is validated through 36 competency questions reflecting real-world stakeholder requirements, and we introduce modeling optimizations that reduce knowledge graph (KG) storage overhead by up to 38.84\% compared to a previous approach, with an additional 26.82\% reduction depending on the desired deployment configuration. This work paves the way for scalable ODA KGs and supports not only analysis within individual systems, but also cross-system analysis across heterogeneous HPC systems.

12:30-14:00Lunch Break
14:00-15:30 Session 11A: DynResHPC.2
14:00
Paula Sánchez-Checa (Universidad Carlos III de Madrid, Spain)
Genaro Sánchez-Gallegos (Universidad Carlos III de Madrid, Spain)
Javier Garcia-Blas (Universidad Carlos III de Madrid, Spain)
Jesus Carretero (Universidad Carlos III de Madrid, Spain)
David E. Singh (Universidad Carlos III de Madrid, Spain)
Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems

ABSTRACT. In this work, we present a dynamic resource management framework that uses malleability both at the application and the parallel file system levels, using EpiGraph and Hercules as use cases. The former is an agent-based, parallel, data-intensive epidemiological simulator. The latter is an ad-hoc parallel in-memory file system. In addition, two optimization algorithms, based on dynamic programming and greedy search, determine the most efficient number of compute and I/O nodes during the application execution. Both optimization algorithms are compared in terms of performance and quality of the solution. Results show that by means of malleability, the execution time of EpiGraph can be reduced up to 62%, while reducing the operational time by 30% compared with a static base case, demonstrating the effectiveness of the methodology.

14:30
Zafer Bora Yılmazer (Technical University of Munich, Germany)
Dominik Huber (Technical University of Munich, Germany)
Arjun Parab (Leibniz Supercomputing Centre, Germany)
Amir Raoofy (Leibniz Supercomputing Centre, Germany)
Josef Weidendorfer (Leibniz Supercomputing Centre, Germany)
Malleability in LAIK with MPI Dynamic Processes and PSets

ABSTRACT. Dynamic resource allocation in High-Performance Computing (HPC) facilitates the adaptation of resource allocation during job runtime. This allows applications to adjust to varying workloads across different execution stages for increased throughput, and to react on system-side job resize requests for improved scheduling. Recent work in dynamic resource management focused either on process-level resource management or on data management across multiple processes, without integrating these approaches. LAIK is a communication library for HPC applications enabling automatic data management and partitioning. This paper proposes to enhance LAIK through the use of the Dynamic PSets Library (DynPSets), based on a dynamic MPI implementation using MPI Sessions. We evaluate the overhead of data redistribution and MPI reconfiguration for two example codes from LAIK, both for expansion and shrinking.

15:00
Ahmad Tarraf (TU Darmstadt, Germany)
Glib Grozin (TU Darmstadt, Germany)
Felix Wolf (TU Darmstadt, Germany)
Dynamic Data Redistribution for Malleable MPI Frameworks through Virtual Topologies

ABSTRACT. High-Performance Computing (HPC) systems play a pivotal role in solving intricate problems efficiently. However, the inherent lack of malleability in these systems, particularly the inability to dynamically adjust resource allocation during job runtime, poses a challenge to optimal resource utilization. To address this limitation, various frameworks have been developed on top of the Message Passing Interface (MPI) to introduce job malleability. On the other hand, MPI’s topologies, a powerful feature to express communication patterns, face a challenge when interfacing with malleability frameworks. However, the concept behind it could simplify data redistribution during malleable reconfigurations. To simplify the process of developing malleable applications, we created a library that aims to streamline the integration of malleable virtual topologies, making dynamic resource allocation more accessible to a broader user base. In this concept paper, we describe our work-in-progress approach, highlighting how it forges the path towards automated data redistribution for malleable applications.

15:15
Iker Martín Álvarez (Universidad Jaume I, Spain)
José Ignacio Aliaga (Computer Science and Engineering Department, University Jaume I, Spain)
Mª Isabel Castillo (Universitat Jaume I, Spain)
Dynamic reconfiguration for malleable applications using RMA

ABSTRACT. This paper investigates the novel one-sided communication methods based on remote memory access (RMA) operations in MPI for dynamic resizing of malleable applications, enabling data redistribution with minimal impact on application execution. After their integration into the MaM library, these methods are compared with traditional collective-based approaches. In addition, the existing strategy Wait Drains is extended to support efficient background reconfiguration. Results show comparable performance, though high initialization costs currently limit their advantage.

14:00-14:30 Session 11B: HiPES.3
14:00
Maximo Rodriguez (Universidad Carlos III de Madrid, Spain)
Dante Sánchez-Gallegos (Universidad Carlos III de Madrid, Spain)
Marco Nuñez (Instituto Nacional de Rehabilitacion "Luis Guillermo Ibarra Ibarra", Mexico)
Heriberto Aguirre-Meneses (Instituto Nacional de Rehabilitacion "Luis Guillermo Ibarra Ibarra", Mexico)
Luis Villalvazo-Gutiérrez (Instituto Nacional de Rehabilitacion "Luis Guillermo Ibarra Ibarra", Mexico)
Mario Ibrahin Gutiérrez Velasco (SECIHTI - Departamento de Sistemas Médicos INRLGII, Mexico)
Jose Luis Gonzalez-Compean (Cinvestav-Tamps, Mexico)
Jesus Carretero (Universidad Carlos III de Madrid, Spain)
A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans

ABSTRACT. Osteosarcoma is the most common primary bone cancer, mainly affecting the youngest and oldest populations. Its detection at early stages is crucial to reduce the probability of developing bone metastasis. In this context, having an accurate and fast diagnosis is crucial to help physicians during the prognosis process. The research goal is to automate the diagnosis of osteosarcoma through a pipeline that includes the pre-processing, detection, post-processing, and visualization of computed tomography (CT) scans. Thus, this paper presents a machine learning and visualization framework for classifying CT scans using different convolutional neural network (CNN) models. Pre-processing includes data augmentation and identification of the region of interest in scans. Post-processing includes data visualization to render a 3D bone model that highlights the affected area. An evaluation on 12 patients revealed the effectiveness of our framework, obtaining an area under the curve of 94.8\% and a specificity of 94.6\%.

14:40-15:30 Session 13: GraphSys.2
14:40
Aristeidis Mastoras (Computing Systems Laboratory, Huawei Zurich Research Center, Switzerland)
Albert-Jan N. Yzelman (Computing Systems Laboratory, Huawei Zurich Research Center, Switzerland)
Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS

ABSTRACT. GraphBLAS allows the expression of algorithms in the language of linear algebra, and it aims at performing automatic optimization and parallelization. Recent work shows that parallel nonblocking execution outperforms the corresponding blocking execution up to 4.11$\times$, by reusing data in cache. However, the presented design, implementation, and evaluation focus on dense vectors. In this work, we present design and implementation extensions for efficient handling of sparse vectors, and to reduce the overhead, we propose compile-time and run-time optimizations. The evaluation shows mixed results and promising speedups up to 1.72$\times$ for the performance of nonblocking over blocking execution.

15:05
Duncan Bart (University of Twente, Netherlands)
Kuan-Hsun Chen (University of Twente, Netherlands)
Ana-Lucia Varbanescu (University of Twente, Netherlands)
Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations

ABSTRACT. When executing graph algorithms on the GPU, multiple implementations often exist for the same algorithm, with varying methods of storing and accessing the graph data. However, depending on the input graph, the performance of these implementations can vary multiple orders of magnitude. Selecting the optimal kernel for a given graph is critical for minimizing wasted resources and compute time. Previous work, which tried to achieve this using analytical modeling, has proven to be unsuccessful. This work introduces millibenchmarking, a method that benchmarks the implementations on a small sample of the input data, and uses the benchmarking results to select the optimal kernel for a given workload. We evaluate our method via seven common sampling strategies on the PageRank algorithm with a diverse set of input graphs. We found that Random Edge Sampling is most effective for selecting the optimal PageRank kernel. Nevertheless, the prediction overhead is yet not negligible, which may be optimized in the future work.

15:30-16:00Coffee Break