View: session overviewtalk overview
Bridging the Data Gaps to Democratize AI in Science, Education and Society
The democratization of Artificial Intelligence (AI) necessitates an ecosystem where data and research infrastructure are seamlessly integrated and universally accessible. This talk overviews the imperative of bridging the gaps between these components through robust services, facilitating an inclusive AI landscape that empowers diverse research communities and domains. The National Data Platform (NDP) aims to lower the barriers to entry for AI research and applications through an integrated services approach to streamline AI workflows, from data acquisition to model deployment. This approach underscores the importance of open, extensible, and equitable systems in driving forward the capabilities of AI, ultimately contributing to the resolution of grand scientific and societal challenges. Through examining real case studies leveraging open data platforms and scalable research infrastructure, the talk will highlight the role of composable systems and services in NDP to catalyze a platform to empower users from all backgrounds to engage in meaningful research, learning, and discovery.
13:30 | Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques PRESENTER: Suren Harutyunyan Gevorgyan ABSTRACT. Leveraging hardware performance counters provides valuable insights into system resource utilization, aiding performance analysis and tuning for parallel applications. The available counters vary with architecture and are collected at execution time. Their abundance and the limited number of registers for measurement make gathering laborious and costly. Efficient characterization of parallel regions necessitates a dimension reduction strategy. While recent efforts have focused on manually reducing the number of counters for specific architectures, this paper introduces a novel approach: an automatic dimension reduction technique for efficiently characterizing parallel code regions across diverse architectures. The methodology is based on Machine Learning ensembles because of their precision and ability at capturing different relationships between the input features and the target variables. Evaluation results show that ensembles can successfully reduce the number of hardware performance counters that characterize a code region. We validate our approach on CPUs using a comprehensive dataset of OpenMP regions, showing that any region can be accurately characterized by 8 relevant hardware performance counters. In addition, we also apply the proposed methodology on GPUs using a reduced set of kernels, demonstrating its effectiveness across various hardware configurations and workloads. |
13:50 | FlexiGran: Flexible Granularity Locking in Hierarchies PRESENTER: Anju Mongandampulath Akathoott ABSTRACT. Locking continues to be a primary technique used to achieve thread synchronization. Especially in the case of rooted hierarchies, semantic locking can be achieved with physical locks at various granularity levels. Such a multi-granularity locking (MGL) provides interesting trade-off between the locking cost and the size of the locked sub-hierarchy. At one extreme, fine-grained locking precisely locks the nodes of interest, but incurs a high locking cost. In contrast, a coarse-grained lock may lock the root of the hierarchy minimizing the locking cost, but locking many more nodes than required. Existing approaches to MGL (i) do not work well with non-tree hierarchies such as DAGs, (ii) disallow structural updates to the hierarchy, (iii) do not support the co-existence of fine-grained and coarse-grained locks, or (iv) are rigid towards their underlying memory usage. In this work, we propose a versatile technique named FlexiGran, which does not have any of these issues. It allows co-existence of hierarchical and fine-grained locks in an arbitrarily shaped hierarchy which can undergo structural alterations at run time, while allowing a user to control its memory usage by adding optional approximations. We illustrate the effectiveness of FlexiGran using STMBench7, and compare it empirically with two recent locking techniques, DomLock and HiFi. On a static hierarchy with more than 1 million nodes, FlexiGran shows an improvement in throughput of around 159\% and 374\% on an average, compared to HiFi and DomLock respectively. |
14:10 | ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels PRESENTER: Christoph Bauinger ABSTRACT. We demonstrate that explicit SIMD programming on GPUs can outperform traditional programming environments such as CUDA and SYCL for three sparse matrix computations found in deep learning applications. Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API allows for simpler vectorization of arithmetic and memory operations which is critical in achieving good performance. We explore sparse matrix operations relevant to deep learning applications, namely the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM (FusedMM). Our ESIMD optimizations target the Intel Data Center GPU Max 1550, a device similar to the Intel GPU used in the nascent Aurora exascale system being deployed at Argonne National Laboratory. We evaluated performance on the test data set used by previous work, and our implementation outperforms state-of-the-art CUDA implementations on the latest NVIDIA hardware by up to a factor of 6.14. Additionally, our proposed implementation outperforms Intel's oneMKL implementation on Intel's GPU. |
13:30 | Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU PRESENTER: Pei Li ABSTRACT. Stencil computations with fully homomorphic encryption (FHE) is an emerging area with significant potential to address the challenges of sensitive data processing and secure outsourcing computing. However, the computational overhead introduced by FHE can drastically reduce the performance of stencil computations compared to unencrypted implementations. This paper proposes two optimized algorithms for stencil computation with FHE tailored to GPU platforms:Matrix Overlap Processing (MOP) and Matrix Fixed-point Processing(MFP). MOP divides the input matrix into multiple slices and then encrypts the elements located at the same positions across different slices into a single ciphertext which is subsequently processed using an identical computing pattern.MFP directly encrypts the neighboring elements into ciphertexts and stores them in a table which will be processed in parallel on GPU. The experimental results show that our proposed methods achieve significant speedups compared to the corresponding OpenMP implementations on CPU. Specifically, the MOP implementation achieves a speedup of 8.7x, while the MFP implementation achieves a speedup of 10.3x on GPU. These results demonstrate the potential of our proposed methods for enabling secure and efficient stencil computations on sensitive data. |
13:50 | A Framework for Automated Parallel Execution of Scientific Multi-Workflow Applications in the Cloud with Work Stealing PRESENTER: Helena Schubert da Incarnacao Lima da Silva ABSTRACT. In this paper, we propose and evaluate an MPI/OpenMP framework to execute cloud applications composed of scientific linear multi-workflows with unknown task execution times, and substantial I/O activity. In order to achieve load balancing, our framework incorporates a two-level work stealing strategy, with intra-node and inter-node stealing. The framework was evaluated in a cluster of 16 virtual machine (VM) instances (4 vCPUs), deployed on AWS Parallel Cluster. The results show that, for a real Bioinformatics application, composed of 400 workflows, we are able to reduce the execution time from 1 hour and 57 minutes (sequential) to 2 minutes and 52 seconds (16 instances), achieving a speedup of 40.89x, with 64 threads. |
14:10 | Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation PRESENTER: Guofeng Feng ABSTRACT. Sparse LU factorization is the indispensable building block of the circuit simulation, and dominates the simulation time, especially when dealing with large-scale circuits. RF circuits have been increasingly emphasized with the evolution of ubiquitous wireless communication (i.e., 5G and WiFi). The RF simulation matrices show a distinctive pattern of structured dense blocks, and this pattern has been inadvertently overlooked by prior works, leading to the underutilization of computational resources. In this paper, by exploiting the block structure, we propose a novel blocked format for L and U factors and re-design the large-scale sparse LU factorization accordingly, which leverages the data locality inherent in RF matrices. The data format transformation is streamlined, strategically eliminating the redundant data movement and costly indirect memory access. Moreover, the vector operations are converted into matrix operations, enabling efficient data reuse and enhancing data-level parallelism. The experiment results show that our method achieves superior performance to state-of-the-art implementation. |
13:30 | DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling PRESENTER: Zechun Zhou ABSTRACT. The surge in deep learning services has precipitated the development of modern large-scale GPU datacenters, which cater to the computational demands of multi-tenant deep learning workloads. These facilities implement virtual cluster partitioning to maintain isolation across product groups. Scaling resource allocation within virtual clusters is crucial for enhancing resource utilization. However, effective resource scaling hinges on accurately forecasting resource demand trends, which is a task complicated by significant variations in GPU utilization among diverse deep learning instances. For this issue, we propose DProbe, a system designed to predict resource demand trends within virtual clusters, employing fine-grained profiling of multi-tenant deep learning workloads. Initially, DProbe employs a job profiler that integrates model-specific attributes with runtime hardware metrics to perform performance modeling for deep learning instances. Resource demands are then estimated through a multi-level approach, considering the distribution of instances across varying levels of GPU utilization. Additionally, DProbe incorporates a multi-task trend predictor to anticipate future resource demand trends, based on historical traces. DProbe's predictions enable precise resource scaling across virtual clusters. We evaluate DProbe using production traces across five scheduling policies and effectively reduce the average job queuing delay by 22.4% to 50.7%. |
13:50 | Towards High-Performance Transactions via Hierarchical Blockchain Sharding PRESENTER: Haibo Tang ABSTRACT. Blockchain sharding, a promising approach to improve system performance, divides the network into several small parallel working shards. However, the performance of existing sharded blockchain systems may degrade seriously due to the existence of cross-shard transactions. To overcome such drawbacks, we propose a blockchain system called HieraChain to process transactions with robust cross-shard transaction tolerance, based on a novel hierarchical sharding architecture. The upper-layer shards order the cross-shard transactions and the participants process them asynchronously to pipeline the transactions ordering. Furthermore, HieraChain proposes an optimized locality-aware protocol to trade off the local access patterns and the induced remote access events. Extensive experimental results demonstrate that HieraChain outperforms the state-of-the-art approaches significantly in the presence of cross-shard transactions, achieving up to 3x and 2x higher throughput than Saguaro and SharPer under general workload respectively. Moreover, our locality-aware approach further reduces transaction latency by 68% and 51% compared to our basic approach and traditional baselines, respectively. |
14:10 | Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems PRESENTER: Tingkai Liu ABSTRACT. HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configuration, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced. The source code and replication packages for reproducing experimental results are provided. |
13:30 | PCTC: Hardware and Software Co-Design for Pruned Capsule Networks on Tensor Cores PRESENTER: Ehsan Atoofian ABSTRACT. Capsule Networks (CapsNets) are a generation of image classifiers that have tak-en the spotlight compared to convolutional neural networks (CNNs). Unlike CNNs, CapsNets are robust to affine transformation and are able to learn spatial relationships between features of an image. Since CapsNets require significant computing horsepower due to intensive matrix operations, GPUs have become the primary hardware platforms for the execution of CapsNets. In particular, GPUs equipped with tensor cores (TCs) are an attractive solution to address the computational requirements of CapsNets as TCs are designed to accelerate matrix operations. However, CapsNets deployed into TCs underutilize computational units as TCs are designed to run dense matrix operations. We propose pruned capsule TC (PCTC) which is a software/hardware co-design approach and avoids underutilization of TC resources. In particular, PCTC changes the sequence of matrix operations for capsule layers so that sparse operations are eliminated. PCTC further enhances the execution of CapsNets on TCs by eliminating those matrix operations that are not necessary to maintain the accuracy of the network. Quite often, CapsNets are designed with large capsules to increase accuracy. By pruning individual capsules, it is feasible to reduce the over-provisioned parame-ter space and reduce energy consumption in CapsNets. Evaluation results reveal that PCTC can achieve 31% energy saving for CapsNet inference, with negligible accuracy loss. |
13:50 | PRESENTER: Zewen Ye ABSTRACT. Bit Flipping Key Encapsulation (BIKE) is a code-based key encapsulation mechanism that utilizes Quasi-Cyclic Medium Density Parity-Check (QC-MDPC) codes, which is a promising candidate in the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography (PQC) standardization process. Polynomial multiplication calculation is the most critical operation in BIKE, which limits the speed of key generation and encapsulation. The high degree of the polynomial not only requires millions of computations but also necessitates complex memory access. To address this issue, we propose a Computation-in-Memory (CIM) based accelerator architecture for polynomial multiplication operations in BIKE. To minimize the size of the CIM core while maintaining high computational throughput, we propose a folded mapping strategy and a one-memory-multiple-NAND architecture. As a result, a 128x128 array is sufficient for the bike1 parameters, with zero padding at the higher part of the multiplicand. Furthermore, we introduce a data flow scheme that integrates carry-less multiplication and polynomial reduction operations to improve computational efficiency. The post-layout simulation results in 28 nm CMOS technology show that our fastest configuration design occupies an area of approximately 1.37 mm² with a low power consumption of around 14.17 mW. Compared with state-of-the-art hardware implementations, our proposed design improves the speed of polynomial multiplication by approximately 2.5 times. |
14:10 | MEPAD: A Memory-efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks PRESENTER: Leandro Fiorin ABSTRACT. Deep Convolutional Neural Networks (CNNs) have been successfully used for processing images, videos, sounds, and more generic sensor data for detecting objects, patterns, and events. In this work, we propose MEPAD, a memory-efficient parallelized direct convolution algorithm for CNNs. We compare MEPAD with several approaches for implementing the convolution that have been proposed in the literature, by optimally mapping them on two implementations of the same SIMD target architectures. By taking as use cases the VGG-16 and TinyYOLOv2 CNNs, we focus on optimizing the memory behavior and energy consumption of the algorithm in each layer of the CNNs, and show that MEPAD can achieve a reduction of up to 85% in the energy-delay product (EDP) when compared to alternative approaches. |