PROGRAM
Days: Monday, August 25th Tuesday, August 26th Wednesday, August 27th Thursday, August 28th Friday, August 29th
Monday, August 25th
View this program: with abstractssession overviewtalk overview
10:30-11:00Coffee Break
12:30-14:00Lunch Break
15:30-16:00Coffee Break
Tuesday, August 26th
View this program: with abstractssession overviewtalk overview
10:30-11:00Coffee Break
12:30-14:00Lunch Break
15:30-16:00Coffee Break
Wednesday, August 27th
View this program: with abstractssession overviewtalk overview
10:30-11:00Coffee Break
11:00-12:30 Session 11A: Track 2.1: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
11:00 | ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments (abstract) |
11:20 | An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment (abstract) |
11:40 | Enabling Elasticity in Scientific Workflows for High Performance Computing Systems (abstract) |
12:00 | WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy (abstract) |
11:00-12:30 Session 11B: Track 3.1: Neural Network Acceleration and Optimization
11:00 | FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference (abstract) |
11:20 | CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA (abstract) |
11:40 | SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration (abstract) |
12:00 | BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators (abstract) |
11:00-12:30 Session 11C: Track 6.1: Memory and I/O Systems
11:00 | NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning (abstract) |
11:20 | Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU (abstract) |
11:40 | GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory (abstract) |
12:00 | Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication (abstract) |
12:30-14:00Lunch Break
14:00-15:00 Session 12A: Track 1.1: Performance Analysis and Simulation
14:00 | Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights (abstract) |
14:20 | TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems (abstract) |
14:40 | THAPI: Tracing Heterogeneous APIs (abstract) |
14:00-15:00 Session 12B: Track 6.2: Learning systems
14:00 | SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training (abstract) |
14:20 | Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms (abstract) |
14:40 | ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs. (abstract) |
15:00-16:00Coffee Break and Poster Session
16:00-17:30 Session 13A: Track 2.2: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
16:00 | HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences (abstract) |
16:20 | CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration (abstract) |
16:40 | Design and Operation of Elastic GPU-pooling on Campus (abstract) |
17:00 | ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory (abstract) |
16:00-17:30 Session 13B: Track 6.3: Stream, Image and Sequence Processing
16:00 | SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure (abstract) |
16:20 | SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture (abstract) |
16:40 | Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster (abstract) |
Thursday, August 28th
View this program: with abstractssession overviewtalk overview
10:00-10:30Coffee Break
10:30-12:30 Session 15: Best Paper Session
10:30 | Noise injection for performance bottleneck analysis (abstract) |
10:50 | Approximation Bounds for SLACK on Identical Parallel Machines (abstract) |
11:10 | SimPoint+: More Stable, Accurate and Efficient Program Analysis (abstract) |
11:30 | AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference (abstract) |
11:50 | Wedge-Parallel Triangle Counting for GPUs (abstract) |
12:10 | External GPU Biconnected Components (abstract) |
12:30-14:00Lunch Break
14:00-15:30 Session 16A: Track 1.2: Compilers, Optimizations, and Scheduling
14:00 | CoSF: A Co-Optimization Framework for Operator Splitting and Fusion (abstract) |
14:20 | Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR (abstract) |
14:40 | Scheduling Task and Data Parallelism in Array Languages with Work Assisting (abstract) |
15:00 | Polymorphic Higher-Order GPU Kernels (abstract) |
14:00-15:30 Session 16B: Track 4.1: Scalable AI Optimization and Parallel Training
14:00 | Saving Memory via Residual Reduction for DNN Training with Compressed Communication (abstract) |
14:20 | Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD (abstract) |
14:40 | Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability (abstract) |
15:00 | Tutoring LLM into a Better CUDA Optimizer (abstract) |
14:00-15:30 Session 16C: Track 3.2: Architecture
14:00 | ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs (abstract) |
14:20 | ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace (abstract) |
14:40 | CSGC: Collaborative File System Garbage Collection with Computational Storage (abstract) |
15:00 | SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches (abstract) |
15:30-16:00Coffee Break
16:00-17:30 Session 17A: Track 3.3: Caching and Memory for ML
16:00 | CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency (abstract) |
16:20 | Cocache: An Accurate And Low-overhead Dynamic Caching Method for GNNs (abstract) |
16:40 | DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System (abstract) |
17:00 | ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors (abstract) |
16:00-17:30 Session 17B: Track 2.3: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
16:00 | MPLS: Stacking Diverse Layers into One Model for Decentralized Federated Learning (abstract) |
16:20 | Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators (abstract) |
16:40 | Auction-based Placement of Functions in the Fog at Scale (abstract) |
17:00 | Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems (abstract) |
16:00-17:30 Session 17C: Track 4.2: Efficient AI Inference and Model Serving at Scale
16:00 | TopServe: Task-Operator Co-Scheduling for Efficient Multi-DNN Inference Serving on GPUs (abstract) |
16:20 | EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse (abstract) |
16:40 | 2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy (abstract) |
17:00 | Light-DiT: An Importance-Aware Dynamic Compression Framework for Diffusion Transformers (abstract) |
Friday, August 29th
View this program: with abstractssession overviewtalk overview
10:00-10:30Coffee Break
10:30-12:00 Session 19A: Track 2.4: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
10:30 | DynoInfer: Adaptive Resource Orchestration for LLM Inference on Resource-Constrained PCs (abstract) |
10:50 | Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning (abstract) |
11:10 | KarmaPM: Reward-Driven Power Manager (abstract) |
11:30 | A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering (abstract) |
10:30-12:00 Session 19B: Track 4.3: Distributed systems, Compression, and Federated Applications
10:30 | DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model (abstract) |
10:50 | Scalable Compression of Massive Data Collections on HPC Systems (abstract) |
11:10 | On-Device Federated Learning for Remote Alpine Livestock Monitoring (abstract) |
11:30 | IAUG: Accelerating Augmentation with Importance Sampling in Deep Neural Network Training (abstract) |
10:30-12:00 Session 19C: Track 5.1: Theory and Algorithms
10:30 | Cache Management for Mixture-of-Experts LLMs (abstract) |
10:50 | Near-optimal contraction strategies for the scalar product in the tensor-train format (abstract) |
11:10 | Supervised Distributed Computing (abstract) |
11:30 | Partial Detectors Versus Replication To Cope With Silent Errors (abstract) |
10:30-12:00 Session 19D: Track 6.4: Graph Algorithms and Linear Algebra
10:30 | Uniform Dense Blocking for Efficient Sparse LU Factorization in First-principles Materials Simulation (abstract) |
10:50 | Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP (abstract) |
11:10 | ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems (abstract) |
12:00-13:30Lunch Break
13:30-14:30 Session 20A: Track 2.5: Scheduling, Resource Management, Cloud, Edge Computing, and Workflows
13:30 | Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism (abstract) |
13:50 | Priority-BF: a Task Manager for Priority-Based Scheduling (abstract) |
14:10 | Green Scheduling on the Edge (abstract) |
13:30-14:30 Session 20B: Track 5.2: Theory and Algorithms
13:30 | Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory (abstract) |
13:50 | Partitioning In-Place on Massively Parallel Systems (abstract) |
13:30-14:30 Session 20C: Track 6.5: GPU and Quantum Systems
13:30 | Disaggregated Design for GPU-Based Volumetric Data Structures (abstract) |
13:50 | Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency (abstract) |
14:10 | SimPart: A Simple Yet Effective Replication-aided Partitioning Algorithm for Logic Simulation on GPU (abstract) |