Program for Friday, August 30th

PROGRAM FOR FRIDAY, AUGUST 30TH

Days:

09:00-10:00 Session 10: Keynote: İlkay Altıntaş

Bridging the Data Gaps to Democratize AI in Science, Education and Society

The democratization of Artificial Intelligence (AI) necessitates an ecosystem where data and research infrastructure are seamlessly integrated and universally accessible. This talk overviews the imperative of bridging the gaps between these components through robust services, facilitating an inclusive AI landscape that empowers diverse research communities and domains. The National Data Platform (NDP) aims to lower the barriers to entry for AI research and applications through an integrated services approach to streamline AI workflows, from data acquisition to model deployment. This approach underscores the importance of open, extensible, and equitable systems in driving forward the capabilities of AI, ultimately contributing to the resolution of grand scientific and societal challenges. Through examining real case studies leveraging open data platforms and scalable research infrastructure, the talk will highlight the role of composable systems and services in NDP to catalyze a platform to empower users from all backgrounds to engage in meaningful research, learning, and discovery.

Chair:

Sameer Shende

Location: Auditorium

10:00-10:30Coffee Break

10:30-12:30 Session 11A: WHPC Session

Chair:

Marta Garcia

Location: Auditorium

10:30	Rosa Badía Making easier the life-cycle management of complex application workflows ABSTRACT. With Exaflop systems already here, high-performance computing (HPC) involves larger and more complex supercomputers. At the same time, the user community is aware of the underlying performance and eager to leverage it by providing more complex application workflows to leverage them. Moreover, current application trends aim to use data analytics and artificial intelligence combined with HPC modelling and simulation. However, the programming models and tools are different in these fields, and there is a need for methodologies that enable the development of workflows that combine HPC software, data analytics, and artificial intelligence. PyCOMPSs is a parallel task-based programming in Python. Based on simple annotations, it can execute sequential Python programs in parallel in HPC clusters and other distributed infrastructures. PyCOMPSs has been extended to support tasks that invoke HPC applications and to combine them with Artificial Intelligence and Data analytics frameworks. In addition, to help with the overall workflow lifecycle management, we have defined the HPC Workflows as a Service (HPCWaaS) methodology that aims to provide tools to simplify the development, deployment, execution, and reuse of workflows. In particular, we will describe the Container Image Creation service, which automates the creation of container images tailored to a specific HPC platform. The talk will present these components as well as how they have been applied in multiple scientific and engineering areas and how we foresee its application in the development of digital twins for science and technology.
11:10	Serena Curzel Pre-Scheduling of Affine Loops for HLS Pipelining ABSTRACT. Loop transformations are essential to improve the quality of results of accelerators generated through High-Level Synthesis (HLS). Loop pipelining is usually performed on a low-level intermediate representation (IR) of the code, which includes the notion of time and has access to information about available resources. In this paper, we introduce loop pipelining as a pre-optimization outside the HLS tool, applying scheduling and code generation to transform affine loops in a high-level frontend based on MLIR. Working on such an abstract level simplifies the analysis of dependencies and the implementation of code generation steps, it does not require access to low-level architectural details, and nevertheless, it can achieve comparable accelerator performance to state-of-practice HLS loop pipelining. The proposed approach does not depend on a specific HLS backend, and it can be easily integrated with existing and future high-level optimizations.
11:40	Marta Bertran Ferrer Evaluation of CPU constraining mechanisms in the LHC ALICE experiment Grid ABSTRACT. Grid workflows across applications show significant differences in resource usage patterns. Sharing resources concurrently comes with the challenge of fluctuating allocations and varying loads on the execution machines. To achieve fair behaviour with other workflows and prevent job terminations due to resource overconsumption, controlled and predictable CPU usage is crucial. This paper evaluates the impact of incorporating CPU constraining mechanisms within the Grid middleware itself, specifically targeting a heterogeneous and dynamic environment like the LHC ALICE experiment Grid. Leveraging existing tools in the operating systems of the execution machines, CPU limitations can be configured at various levels, ranging from pinning the execution on specific CPU cores to setting a dedicated CPU bandwidth. We present performance results demonstrating that CPU-constrained execution environments lead to highly predictable CPU efficiency for jobs. Additionally, the effects of NUMA-aware scheduling for processes spawned by these jobs are explored, evaluating how it impacts performance on different execution environments.

10:30-12:30 Session 11B: Industrial Session

Location: -1.A.01

10:30	Helena Vela Supporting HPC Centers: challenges, horror stories and best practices ABSTRACT. Despite HPC centers being the bearers of cutting-edge technologies and lead by example in excellence, we often find that most of the HPC systems performance is not squeezed to its maximum. Real world problems found in these clusters such as the use of unoptimized software, lack of understanding of the hardware underneath or machines as black boxes, novel users, poor scheduler configuration and lack of scheduler knowledge, inefficient filesystems, missing monitoring and similar issues harm the full potential of HPC systems. User support teams are crucial in ensuring the optimal utilization of HPC clusters. By proactively identifying, resolving, and preventing issues, they play a pivotal role in maintaining system reliability and enhancing user experience. In this talk, real examples of these problems and the solutions provided will be covered to enforce the best practices across many HPC research centers and institutions.
11:00	Sameer Shende ParaTools Pro for E4S ABSTRACT. How do you configure, manage and launch multi-node, multi-user clusters on AWS for HPC and AI applications? How do you manage the complexity of the software stack to deploy your applications? ParaTools Pro for E4S™- the Extreme-scale Scientific Software Stack E4S - hardened for commercial clouds and supported by ParaTools, Inc. provides a platform for developing and deploying HPC and AI/ML applications. It features a performant remote desktop environment (based on VNC) on the login node and compute nodes interconnected by a low-latency, high bandwidth network adapter based on AWS Elastic Fabric Adapter (EFA). ParaTools Pro for E4S™ features a suite of over 100 HPC tools built using the Spack package manager and the proprietary MVAPICH MPI tuned for high speed network interface cards on commercial cloud platforms. It features ready to use HPC applications (such as OpenFOAM, LAMMPS, Xyce, Quantum Espresso) as well as AI/ML tools based on Python (such as NVIDIA NeMo™, TensorFlow, PyTorch, JAX, Horovod, Keras, OpenCV, matplotlib and supports Jupyter notebooks) and the Codium IDE. New packages can be easily installed using Spack and pip and are accessible on the cluster compute and login nodes. It may be used for developing the next generation of generative AI applications using a suite of Python tools and interfaces. For more information: https://paratoolspro.com”
11:30	Elisabetta Boella E4 at the forefront of European HPC ABSTRACT. E4 Computer Engineering is an Italian solution provider for High Performance Computing, Artificial Intelligence and Cloud Computing. Since 2012, we have been involved in several national and European projects, either collaborating on co-design tasks or on the development of prototypes and platforms. In this talk, I will present the company and some of its recent success stories. I will then discuss E4’s role within the Centres of Excellence MaX, SPACE, and EoCoE.

12:30-13:30Lunch Break

13:30-14:50 Session 12A: Programming, Compilers and Performance (I)

Chair:

Javier Fernandez Muñoz

Location: -1.A.05

13:30	Suren Harutyunyan Gevorgyan, Anna Sikora, Eduardo Cesar, Jiří Filipovič, Akash Dutta, Ali Jannesari and Jordi Alcaraz Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques PRESENTER: Suren Harutyunyan Gevorgyan ABSTRACT. Leveraging hardware performance counters provides valuable insights into system resource utilization, aiding performance analysis and tuning for parallel applications. The available counters vary with architecture and are collected at execution time. Their abundance and the limited number of registers for measurement make gathering laborious and costly. Efficient characterization of parallel regions necessitates a dimension reduction strategy. While recent efforts have focused on manually reducing the number of counters for specific architectures, this paper introduces a novel approach: an automatic dimension reduction technique for efficiently characterizing parallel code regions across diverse architectures. The methodology is based on Machine Learning ensembles because of their precision and ability at capturing different relationships between the input features and the target variables. Evaluation results show that ensembles can successfully reduce the number of hardware performance counters that characterize a code region. We validate our approach on CPUs using a comprehensive dataset of OpenMP regions, showing that any region can be accurately characterized by 8 relevant hardware performance counters. In addition, we also apply the proposed methodology on GPUs using a reduced set of kernels, demonstrating its effectiveness across various hardware configurations and workloads.
13:50	Anju Mongandampulath Akathoott and Rupesh Nasre. FlexiGran: Flexible Granularity Locking in Hierarchies PRESENTER: Anju Mongandampulath Akathoott ABSTRACT. Locking continues to be a primary technique used to achieve thread synchronization. Especially in the case of rooted hierarchies, semantic locking can be achieved with physical locks at various granularity levels. Such a multi-granularity locking (MGL) provides interesting trade-off between the locking cost and the size of the locked sub-hierarchy. At one extreme, fine-grained locking precisely locks the nodes of interest, but incurs a high locking cost. In contrast, a coarse-grained lock may lock the root of the hierarchy minimizing the locking cost, but locking many more nodes than required. Existing approaches to MGL (i) do not work well with non-tree hierarchies such as DAGs, (ii) disallow structural updates to the hierarchy, (iii) do not support the co-existence of fine-grained and coarse-grained locks, or (iv) are rigid towards their underlying memory usage. In this work, we propose a versatile technique named FlexiGran, which does not have any of these issues. It allows co-existence of hierarchical and fine-grained locks in an arbitrarily shaped hierarchy which can undergo structural alterations at run time, while allowing a user to control its memory usage by adding optional approximations. We illustrate the effectiveness of FlexiGran using STMBench7, and compare it empirically with two recent locking techniques, DomLock and HiFi. On a static hierarchy with more than 1 million nodes, FlexiGran shows an improvement in throughput of around 159\% and 374\% on an average, compared to HiFi and DomLock respectively.
14:10	Mohammad Zubair and Christoph Bauinger ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels PRESENTER: Christoph Bauinger ABSTRACT. We demonstrate that explicit SIMD programming on GPUs can outperform traditional programming environments such as CUDA and SYCL for three sparse matrix computations found in deep learning applications. Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API allows for simpler vectorization of arithmetic and memory operations which is critical in achieving good performance. We explore sparse matrix operations relevant to deep learning applications, namely the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM (FusedMM). Our ESIMD optimizations target the Intel Data Center GPU Max 1550, a device similar to the Intel GPU used in the nascent Aurora exascale system being deployed at Argonne National Laboratory. We evaluated performance on the test data set used by previous work, and our implementation outperforms state-of-the-art CUDA implementations on the latest NVIDIA hardware by up to a factor of 6.14. Additionally, our proposed implementation outperforms Intel's oneMKL implementation on Intel's GPU.

13:30-14:50 Session 12B: Multidisciplinary, Domain-Specific and Applied Parallel and Distributed Computing (IV)

Chair:

Manuel Capel

Location: -1.A.01

13:30	Xianlong Zhou, Pei Li, Jiageng Chen and Shixiong Yao Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU PRESENTER: Pei Li ABSTRACT. Stencil computations with fully homomorphic encryption (FHE) is an emerging area with significant potential to address the challenges of sensitive data processing and secure outsourcing computing. However, the computational overhead introduced by FHE can drastically reduce the performance of stencil computations compared to unencrypted implementations. This paper proposes two optimized algorithms for stencil computation with FHE tailored to GPU platforms:Matrix Overlap Processing (MOP) and Matrix Fixed-point Processing(MFP). MOP divides the input matrix into multiple slices and then encrypts the elements located at the same positions across different slices into a single ciphertext which is subsequently processed using an identical computing pattern.MFP directly encrypts the neighboring elements into ciphertexts and stores them in a table which will be processed in parallel on GPU. The experimental results show that our proposed methods achieve significant speedups compared to the corresponding OpenMP implementations on CPU. Specifically, the MOP implementation achieves a speedup of 8.7x, while the MFP implementation achieves a speedup of 10.3x on GPU. These results demonstrate the potential of our proposed methods for enabling secure and efficient stencil computations on sensitive data.
13:50	Helena Schubert da Incarnacao Lima da Silva, Maria Clicia Stelling de Castro, Fabricio Alves Barbosa da Silva and Alba Cristina Magalhaes Alves de Melo A Framework for Automated Parallel Execution of Scientific Multi-Workflow Applications in the Cloud with Work Stealing PRESENTER: Helena Schubert da Incarnacao Lima da Silva ABSTRACT. In this paper, we propose and evaluate an MPI/OpenMP framework to execute cloud applications composed of scientific linear multi-workflows with unknown task execution times, and substantial I/O activity. In order to achieve load balancing, our framework incorporates a two-level work stealing strategy, with intra-node and inter-node stealing. The framework was evaluated in a cluster of 16 virtual machine (VM) instances (4 vCPUs), deployed on AWS Parallel Cluster. The results show that, for a real Bioinformatics application, composed of 400 workflows, we are able to reduce the execution time from 1 hour and 57 minutes (sequential) to 2 minutes and 52 seconds (16 instances), achieving a speedup of 40.89x, with 64 threads.
14:10	Guofeng Feng, Hongyu Wang, Zhuoqiang Guo, Mingzhen Li, Tong Zhao, Zhou Jin, Weile Jia, Guangming Tan and Ninghui Sun Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation PRESENTER: Guofeng Feng ABSTRACT. Sparse LU factorization is the indispensable building block of the circuit simulation, and dominates the simulation time, especially when dealing with large-scale circuits. RF circuits have been increasingly emphasized with the evolution of ubiquitous wireless communication (i.e., 5G and WiFi). The RF simulation matrices show a distinctive pattern of structured dense blocks, and this pattern has been inadvertently overlooked by prior works, leading to the underutilization of computational resources. In this paper, by exploiting the block structure, we propose a novel blocked format for L and U factors and re-design the large-scale sparse LU factorization accordingly, which leverages the data locality inherent in RF matrices. The data format transformation is streamlined, strategically eliminating the redundant data movement and costly indirect memory access. Moreover, the vector operations are converted into matrix operations, enabling efficient data reuse and enhancing data-level parallelism. The experiment results show that our method achieves superior performance to state-of-the-art implementation.

13:30-14:50 Session 12C: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows (IV)

Chair:

Dominik Hubert

Location: -1.A.06

13:30	Zechun Zhou, Jingwei Sun, Hengquan Mei, Peng Sun and Guangzhong Sun DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling PRESENTER: Zechun Zhou ABSTRACT. The surge in deep learning services has precipitated the development of modern large-scale GPU datacenters, which cater to the computational demands of multi-tenant deep learning workloads. These facilities implement virtual cluster partitioning to maintain isolation across product groups. Scaling resource allocation within virtual clusters is crucial for enhancing resource utilization. However, effective resource scaling hinges on accurately forecasting resource demand trends, which is a task complicated by significant variations in GPU utilization among diverse deep learning instances. For this issue, we propose DProbe, a system designed to predict resource demand trends within virtual clusters, employing fine-grained profiling of multi-tenant deep learning workloads. Initially, DProbe employs a job profiler that integrates model-specific attributes with runtime hardware metrics to perform performance modeling for deep learning instances. Resource demands are then estimated through a multi-level approach, considering the distribution of instances across varying levels of GPU utilization. Additionally, DProbe incorporates a multi-task trend predictor to anticipate future resource demand trends, based on historical traces. DProbe's predictions enable precise resource scaling across virtual clusters. We evaluate DProbe using production traces across five scheduling policies and effectively reduce the average job queuing delay by 22.4% to 50.7%.
13:50	Haibo Tang, Huan Zhang, Zhenyu Zhang, Zhao Zhang, Cheqing Jin and Aoying Zhou Towards High-Performance Transactions via Hierarchical Blockchain Sharding PRESENTER: Haibo Tang ABSTRACT. Blockchain sharding, a promising approach to improve system performance, divides the network into several small parallel working shards. However, the performance of existing sharded blockchain systems may degrade seriously due to the existence of cross-shard transactions. To overcome such drawbacks, we propose a blockchain system called HieraChain to process transactions with robust cross-shard transaction tolerance, based on a novel hierarchical sharding architecture. The upper-layer shards order the cross-shard transactions and the participants process them asynchronously to pipeline the transactions ordering. Furthermore, HieraChain proposes an optimized locality-aware protocol to trade off the local access patterns and the induced remote access events. Extensive experimental results demonstrate that HieraChain outperforms the state-of-the-art approaches significantly in the presence of cross-shard transactions, achieving up to 3x and 2x higher throughput than Saguaro and SharPer under general workload respectively. Moreover, our locality-aware approach further reduces transaction latency by 68% and 51% compared to our basic approach and traditional baselines, respectively.
14:10	Tingkai Liu, Huili Tao, Yicheng Lu, Zhongbo Zhu, Marquita Ellis, Sara Kokkila-Schumacher and Volodymyr Kindratenko Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems PRESENTER: Tingkai Liu ABSTRACT. HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configuration, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced. The source code and replication packages for reproducing experimental results are provided.

13:30-14:50 Session 12D: Architectures and Accelerators (IV)

Chair:

Raffaele Montella

Location: -1.A.04

13:30	Mohammad Hafezan, Reza Jahadi and Ehsan Atoofian PCTC: Hardware and Software Co-Design for Pruned Capsule Networks on Tensor Cores PRESENTER: Ehsan Atoofian ABSTRACT. Capsule Networks (CapsNets) are a generation of image classifiers that have tak-en the spotlight compared to convolutional neural networks (CNNs). Unlike CNNs, CapsNets are robust to affine transformation and are able to learn spatial relationships between features of an image. Since CapsNets require significant computing horsepower due to intensive matrix operations, GPUs have become the primary hardware platforms for the execution of CapsNets. In particular, GPUs equipped with tensor cores (TCs) are an attractive solution to address the computational requirements of CapsNets as TCs are designed to accelerate matrix operations. However, CapsNets deployed into TCs underutilize computational units as TCs are designed to run dense matrix operations. We propose pruned capsule TC (PCTC) which is a software/hardware co-design approach and avoids underutilization of TC resources. In particular, PCTC changes the sequence of matrix operations for capsule layers so that sparse operations are eliminated. PCTC further enhances the execution of CapsNets on TCs by eliminating those matrix operations that are not necessary to maintain the accuracy of the network. Quite often, CapsNets are designed with large capsules to increase accuracy. By pruning individual capsules, it is feasible to reduce the over-provisioned parame-ter space and reduce energy consumption in CapsNets. Evaluation results reveal that PCTC can achieve 31% energy saving for CapsNet inference, with negligible accuracy loss.
13:50	Chuhui Wang, Zewen Ye, Haibin Shen and Kejie Huang A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE PRESENTER: Zewen Ye ABSTRACT. Bit Flipping Key Encapsulation (BIKE) is a code-based key encapsulation mechanism that utilizes Quasi-Cyclic Medium Density Parity-Check (QC-MDPC) codes, which is a promising candidate in the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography (PQC) standardization process. Polynomial multiplication calculation is the most critical operation in BIKE, which limits the speed of key generation and encapsulation. The high degree of the polynomial not only requires millions of computations but also necessitates complex memory access. To address this issue, we propose a Computation-in-Memory (CIM) based accelerator architecture for polynomial multiplication operations in BIKE. To minimize the size of the CIM core while maintaining high computational throughput, we propose a folded mapping strategy and a one-memory-multiple-NAND architecture. As a result, a 128x128 array is sufficient for the bike1 parameters, with zero padding at the higher part of the multiplicand. Furthermore, we introduce a data flow scheme that integrates carry-less multiplication and polynomial reduction operations to improve computational efficiency. The post-layout simulation results in 28 nm CMOS technology show that our fastest configuration design occupies an area of approximately 1.37 mm² with a low power consumption of around 14.17 mW. Compared with state-of-the-art hardware implementations, our proposed design improves the speed of polynomial multiplication by approximately 2.5 times.
14:10	Leandro Fiorin and Cristina Silvano MEPAD: A Memory-efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks PRESENTER: Leandro Fiorin ABSTRACT. Deep Convolutional Neural Networks (CNNs) have been successfully used for processing images, videos, sounds, and more generic sensor data for detecting objects, patterns, and events. In this work, we propose MEPAD, a memory-efficient parallelized direct convolution algorithm for CNNs. We compare MEPAD with several approaches for implementing the convolution that have been proposed in the literature, by optimally mapping them on two implementations of the same SIMD target architectures. By taking as use cases the VGG-16 and TinyYOLOv2 CNNs, we focus on optimizing the memory behavior and energy consumption of the algorithm in each layer of the CNNs, and show that MEPAD can achieve a reduction of up to 85% in the energy-delay product (EDP) when compared to alternative approaches.

14:50-15:00 Session 13: Conference Closing

Chairs:

Jesus Carretero and Javier Garcia Blas

Location: Auditorium