Talk Keyword Index

TALK KEYWORD INDEX

This page contains an index consisting of author-provided keywords.

Shortcuts: 3 A B C D E F G H I J K L M N O P Q R S T V W

3
3D reconstruction	EAGER: Energy-Aware 3D Gaussian Splatting on Embedded Parallel Heterogeneous Systems
A
Activation Sparsity	A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering
actor-based concurrency	Comparative Analysis of Energy Efficiency in Actor-Based Applications in Distributed Environments
Adaptive scheduling	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems
agentic-ai	Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems
AI	Tutoring LLM into a Better CUDA Optimizer From Reactive Debugging to Proactive Detection: AI for Performance-Aware Software Development
AI Accelerators	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration
AIPC	DynoInfer: Adaptive Resource Orchestration for LLM Inference on Resource-Constrained PCs
Algorithmic Skeletons	Polymorphic Higher-Order GPU Kernels
AlphaTensor	AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference
AMR	Efficient Anisotropic Mesh Refinement with Omnitrees ...or How to Get Cat GIFs Into Your Paper
Analog in-memory computing	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
analytics	Open, cross-architecture acceleration of data analytics with SYCL and RISC-V
Apache Spark	Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Application malleability	Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems
Approximate Computing	Mixed precision over GPU applied to a Microphysics model
Approximation ratio	Approximation Bounds for SLACK on Identical Parallel Machines
ARM SVE	ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
Articulation points	External GPU Biconnected Components
Asynchronous	Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection
Asynchronous Data Processing	Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD
Asynchronous Programming	On-the-fly Performance Analysis of Asynchronous Parallel Execution
auction	Auction-based Placement of Functions in the Fog at Scale
Auto-scaling	HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences
auto-scheduling	Boosting Performance of Counting Queries in Machine Learning Applications with a ccNUMA-aware Implementation
Autonomy Loops	An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment
Autoscaling	ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory
B
Batch processing	External GPU Biconnected Components
Benchmark Suite	SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure
benchmark suite optimization	OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
Benchmarking	A Comparative Study of Streaming Graph Processing Systems
Benchmarks	Heterogeneous computing, storage and network infrastructures for medical applications
Biconnected components	External GPU Biconnected Components
Big Data	Scalable Compression of Massive Data Collections on HPC Systems Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Bioinformatics	SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture
bottleneck detection	Noise injection for performance bottleneck analysis
Branch prediction	SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches
Byzantine failures	Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
C
Cache	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors
Cache management	Cache Management for Mixture-of-Experts LLMs
Caching / Paging	Cache Management for Mixture-of-Experts LLMs
Caching update	Cocache: An Accurate And Low-overhead Dynamic Caching Method for GNNs
carbon cost	Green Scheduling on the Edge
Carbon emissions	Analysis of the carbon footprint of HPC
CAS	Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
CCLs	Targeted data movement optimizations for emerging heterogeneous supercomputers
ccNUMA	Boosting Performance of Counting Queries in Machine Learning Applications with a ccNUMA-aware Implementation
CFD	Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe
Chapel	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
Checkpointing	An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment
cloud	Green Scheduling on the Edge SCOPE: Accelerating ML data pipeline using cloud-based computational storage
cloud computing	ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines
Cloud computing	Alumet: a modular framework to standardize the measurement of energy consumption
Cloud Continuum	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum
Cloud Robotics	Light Weight Scalable DevOps for Cloud Robotics
cloud-to-thing	Auction-based Placement of Functions in the Fog at Scale
Clouds	Heterogeneous computing, storage and network infrastructures for medical applications
CloudSim	CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines
Clustering	A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering
CNN Inference	Portable and Scalable FPGA Emulation of a Massive-Parallel Vector Processor
Co-Design Framework	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
co-running	KarmaPM: Reward-Driven Power Manager Power Scheduling on Multicore Multiprocessor Systems for Maximizing Throughput and Fairness
Code generation	Scheduling Task and Data Parallelism in Array Languages with Work Assisting Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors
collaborative system of systems	HPC Software as a Service: A Flexible Approach to Data Logistics
Collective Algorithms	Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights
Collective communication	SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training
Competitive analysis	Cache Management for Mixture-of-Experts LLMs
compilation	Noise injection for performance bottleneck analysis
Compiler Optimization	CoSF: A Co-Optimization Framework for Operator Splitting and Fusion
Composability	A Case Study for Resolving Composability Issues Using a Shared CPU Resource Coordinator
Compressed Communication	Saving Memory via Residual Reduction for DNN Training with Compressed Communication
Computational Efficiency	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration
Computational Fluid Dynamics	Exploring Flow Fields at Scale: GPU-Accelerated Scientific Visualization for Exascale CFD
computational storage device	CSGC: Collaborative File System Garbage Collection with Computational Storage
computational workflows	Simplifying distributed workflows: A portable approach for Cloud and HPC
compute	Auction-based Placement of Functions in the Fog at Scale
Compute Express Link (CXL)	TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems
Computed Tomography	A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans
Computer architecture	SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches
Computing Continuum	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum
Concurrent kernel execution	CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency
Consensus	Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
Continuous Profiling	Thread Monitoring Tool: transparent characterization of threading patterns with eBPF
continuum	Auction-based Placement of Functions in the Fog at Scale
Convolutional Neural Networks	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors
Convolutional Neural Networks (CNN)	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration
counting queries	Boosting Performance of Counting Queries in Machine Learning Applications with a ccNUMA-aware Implementation
CPU	A Case Study for Resolving Composability Issues Using a Shared CPU Resource Coordinator
CPU utilization	KarmaPM: Reward-Driven Power Manager Power Scheduling on Multicore Multiprocessor Systems for Maximizing Throughput and Fairness
Critical Path	Tracking the Critical Path of Execution for GPU Offloading Applications
cross-facility workflows	HPC Software as a Service: A Flexible Approach to Data Logistics
CUDA	Tutoring LLM into a Better CUDA Optimizer AskLLVM: LLVM Code Generation for GPUs for Graph Algorithms
CUDA/HIP	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
Cut vertices	External GPU Biconnected Components
D
DAG	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Data Analysis of Scientific Computing	DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model
Data Augmentation	IAUG: Accelerating Augmentation with Importance Sampling in Deep Neural Network Training
Data Center	Towards Digital Twins of HPC Data Centres Modelling Infrastructure and HPC Systems for IT-Zauber
Data Classification	A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans
Data Compression	Scalable Compression of Massive Data Collections on HPC Systems
data logistic	HPC Software as a Service: A Flexible Approach to Data Logistics
Data preprocessing	SCOPE: Accelerating ML data pipeline using cloud-based computational storage
data redistribution	Dynamic Data Redistribution for Malleable MPI Frameworks through Virtual Topologies
data streaming	Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
Data structure	Disaggregated Design for GPU-Based Volumetric Data Structures
data structures	Efficient Anisotropic Mesh Refinement with Omnitrees ...or How to Get Cat GIFs Into Your Paper
Data-intensive applications	Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems
database	Open, cross-architecture acceleration of data analytics with SYCL and RISC-V
datacenter	Design and Operation of Elastic GPU-pooling on Campus
Dataflow Optimization	FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference
dataflow programming	Simplifying distributed workflows: A portable approach for Cloud and HPC
Debugging	THAPI: Tracing Heterogeneous APIs From Reactive Debugging to Proactive Detection: AI for Performance-Aware Software Development
Decentralized	Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection
Decentralized Federated Learning	MPLS: Stacking Diverse Layers into One Model for Decentralized Federated Learning
Decentralized Systems	Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
Decoupled AllReduce	SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training
deep domain adaptation	Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning
Deep Learning	IAUG: Accelerating Augmentation with Importance Sampling in Deep Neural Network Training 2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans SCOPE: Accelerating ML data pipeline using cloud-based computational storage
Deep Learning Serving Systems	TopServe: Task-Operator Co-Scheduling for Efficient Multi-DNN Inference Serving on GPUs
Deep Neural Network	CoSF: A Co-Optimization Framework for Operator Splitting and Fusion
dense vectors	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
Dependency Flagging System	Modifying the HyperLedger Fabric Blockchain Architecture to increase throughput and decrease transaction rejections
Dependency-aware Transaction Processing	Modifying the HyperLedger Fabric Blockchain Architecture to increase throughput and decrease transaction rejections
Device Heterogeneity	MPLS: Stacking Diverse Layers into One Model for Decentralized Federated Learning
DevOps	Light Weight Scalable DevOps for Cloud Robotics
Diffusion Model	DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model
Diffusion Model Accelerator	FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference
Diffusion Transformers	Light-DiT: An Importance-Aware Dynamic Compression Framework for Diffusion Transformers
Digital Twin	Towards Digital Twins of HPC Data Centres Modelling Infrastructure and HPC Systems for IT-Zauber
Directed Acyclic Task Graph (DATG) scheduling QR factorization	Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP
Disaggregated memory	ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory
Distributed Computing	Supervised Distributed Computing
Distributed computing	Alumet: a modular framework to standardize the measurement of energy consumption
Distributed deep learning	H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
Distributed Dense Linear Algebra	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Distributed Sparse Linear Algebra	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Distributed Systems	SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning Comparative Analysis of Energy Efficiency in Actor-Based Applications in Distributed Environments Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Distributed training	SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training
Distributed Training and Inference	TH-Pulse: A Study on Hardware-Software Co-Designed Framework for LLM Training and Inference on the Tianhe new-generation supercomputer
distributed workflows	Simplifying distributed workflows: A portable approach for Cloud and HPC
distributed-computing	Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems
DMA	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors
DNN Training	Saving Memory via Residual Reduction for DNN Training with Compressed Communication NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning
Docker containers	Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning
DPDK	Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU
DSL	Polymorphic Higher-Order GPU Kernels
Dual-Cache	DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
Dynamic caching method	Cocache: An Accurate And Low-overhead Dynamic Caching Method for GNNs
Dynamic Graphs	A Comparative Study of Streaming Graph Processing Systems
Dynamic programming	Near-optimal contraction strategies for the scalar product in the tensor-train format Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors
Dynamic Resource Allocation	DynoInfer: Adaptive Resource Orchestration for LLM Inference on Resource-Constrained PCs Malleability in LAIK with MPI Dynamic Processes and PSets
dynamic resource allocation	Experimental Evaluation of Scheduling Strategies for Evolving Workflow-Based Applications
Dynamic Resource Management	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models
Dynamic resource management	Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems
Dynamic Resources	Dynamic reconfiguration for malleable applications using RMA
E
eBPF	Thread Monitoring Tool: transparent characterization of threading patterns with eBPF
Edge Accelerators	Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators
Edge AI	A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering On-Device Federated Learning for Remote Alpine Livestock Monitoring
Edge computing	2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy
Edge Network	MPLS: Stacking Diverse Layers into One Model for Decentralized Federated Learning
edge platform	Green Scheduling on the Edge
Edge-AI	Efficient FPGA-based GAN Accelerator Core for Edge-AI Platforms
Edge-Cloud Continuum	Federated Learning in the Edge-Cloud Continuum: A Task-Based Approach with Colony
Education	Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights
Efficiency	A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering
Efficient Inference on Local platforms	ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs.
Elastic Computing	Malleability in LAIK with MPI Dynamic Processes and PSets
Elastic HPC	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems
Electronic Design Automation	Accelerating Gate Sizing using GPU
Elixir	Polymorphic Higher-Order GPU Kernels
Embedding Table	ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory
Emerging Memory System	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
Empirical Comparison	A Comparative Study of Streaming Graph Processing Systems
energy awareness	SYCL for Energy-Efficient Computational Astrophysics: the case of DPEcho
Energy consumption	Alumet: a modular framework to standardize the measurement of energy consumption
Energy Efficiency	Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations Analysis of the carbon footprint of HPC HPC Benchmark Game: Comparing Programming Languages Regarding Energy-Efficiency for Applications from the HPC Field Comparative Analysis of Energy Efficiency in Actor-Based Applications in Distributed Environments Towards Digital Twins of HPC Data Centres Modelling Infrastructure and HPC Systems for IT-Zauber
Energy measurement	Alumet: a modular framework to standardize the measurement of energy consumption
energy performance	SYCL for Energy-Efficient Computational Astrophysics: the case of DPEcho
Energy-Aware 3D Gaussian Splatting	EAGER: Energy-Aware 3D Gaussian Splatting on Embedded Parallel Heterogeneous Systems
energy-aware algorithms	Green Scheduling on the Edge
Energy-Aware Scheduling	Green Energy Aware Scheduling of Scientific Workflows with Flexible Deadlines
Energy-aware software engineering	Time-related effects in the measurement of energy consumption in evolutionary algorithms
Ethernet	Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU
Evolutionary computation	Time-related effects in the measurement of energy consumption in evolutionary algorithms
evolving applications	Experimental Evaluation of Scheduling Strategies for Evolving Workflow-Based Applications
F
FaaS	Auction-based Placement of Functions in the Fog at Scale
Fault Tolerance	Supervised Distributed Computing
Fault-Free	Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection
Federated Learning	Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators On-Device Federated Learning for Remote Alpine Livestock Monitoring Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection
Federated Learning	Federated Learning in the Edge-Cloud Continuum: A Task-Based Approach with Colony
FIM	EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
First-principles materials simulation	Uniform Dense Blocking for Efficient Sparse LU Factorization in First-principles Materials Simulation
Floating-Point Non-Associatvity	Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability
Flooding	A framework for flooding early warning leveraging AI, HPC, and computing continuum
Flowshop Scheduling	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
fog	Auction-based Placement of Functions in the Fog at Scale
FPGA	Efficient FPGA-based GAN Accelerator Core for Edge-AI Platforms
FPGA Accelerator	CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
FPGA Demonstrator	Portable and Scalable FPGA Emulation of a Massive-Parallel Vector Processor
FPGAs	Exploiting highly heterogenous systems with stencil applications
function-as-a-service	Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems Auction-based Placement of Functions in the Fog at Scale
Functional array languages	Scheduling Task and Data Parallelism in Array Languages with Work Assisting
G
GANs	Efficient FPGA-based GAN Accelerator Core for Edge-AI Platforms
garbage collection	CSGC: Collaborative File System Garbage Collection with Computational Storage
Gate sizing	Accelerating Gate Sizing using GPU
GENE	In-Situ Techniques for the Efficient Coupling of Complex Plasma Turbulence Simulations: GENE and GENE-X
GENE-X	In-Situ Techniques for the Efficient Coupling of Complex Plasma Turbulence Simulations: GENE and GENE-X
Generate code	Tutoring LLM into a Better CUDA Optimizer
Generative Adversarial Networks	Efficient FPGA-based GAN Accelerator Core for Edge-AI Platforms
Genomics	Evaluating Energy Efficiency of Genomics Algorithms on Processing-in-Memory Architectures
Gigapixel Images	Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
GPU	Partitioning In-Place on Massively Parallel Systems Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory Disaggregated Design for GPU-Based Volumetric Data Structures Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU External GPU Biconnected Components Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics Performance optimization of GROMACS on modern Hardware FLEXI: Scale-resolving simulations of compressible turbulence on modern HPC systems Exploring Flow Fields at Scale: GPU-Accelerated Scientific Visualization for Exascale CFD
GPU Acceleration	AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference
GPU allocation	HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences
GPU architectures	Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors
GPU cache management	CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency
GPU code generation	Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR
GPU Computing	Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
GPU Computing	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
GPU parallel	Accelerating Gate Sizing using GPU
GPU power modeling	Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
GPU programming	SYCL for Energy-Efficient Computational Astrophysics: the case of DPEcho
GPU scheduling	HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences
GPUs	Polymorphic Higher-Order GPU Kernels Design and Operation of Elastic GPU-pooling on Campus Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication
Grace Hopper	Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU
Gradient Compression	NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning SQ-DeAR: Sparsified and Quantized Gradient Compression for Distributed Training
Graph Neural Network (GNN)	Cocache: An Accurate And Low-overhead Dynamic Caching Method for GNNs
Graph Neural Networks	DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
graph partitioning	SimPart: A Simple Yet Effective Replication-aided Partitioning Algorithm for Logic Simulation on GPU
Graph Processing	CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration Wedge-Parallel Triangle Counting for GPUs Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations
Graph Sampling	Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations
GraphBLAS	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
Graphics Processing Unit (GPU)	Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations
Green Computing	Comparative Analysis of Energy Efficiency in Actor-Based Applications in Distributed Environments Green Energy Aware Scheduling of Scientific Workflows with Flexible Deadlines Time-related effects in the measurement of energy consumption in evolutionary algorithms
Green's Function	DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model
Green500	Analysis of the carbon footprint of HPC
Grid’5000	Auction-based Placement of Functions in the Fog at Scale
GROMACS	Performance optimization of GROMACS on modern Hardware
H
hardware acceleration	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration Open, cross-architecture acceleration of data analytics with SYCL and RISC-V
Hardware Accelerator	Portable and Scalable FPGA Emulation of a Massive-Parallel Vector Processor
Hardware overprovisioning	KarmaPM: Reward-Driven Power Manager Power Scheduling on Multicore Multiprocessor Systems for Maximizing Throughput and Fairness
Hardware-Efficient Inference	Light-DiT: An Importance-Aware Dynamic Compression Framework for Diffusion Transformers
Heterogeneous	SIMON: A Simple Monitoring Framework for Heterogeneous Application Observability
Heterogeneous architecture	Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR FDHA: Fusion-Driven Heterogeneous Accelerator for Efficient Diffusion Model Inference
Heterogeneous computing	CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration Simplifying distributed workflows: A portable approach for Cloud and HPC Exploiting highly heterogenous systems with stencil applications OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing Heterogeneous computing, storage and network infrastructures for medical applications
Heterogeneous Density Problem	Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
High Performance Computing	Analysis of the carbon footprint of HPC
High performance training	H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
High-Performance Computing	ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models DCG-DDQ: A Directed Cyclic Graph Based Task Computing System
High-Performance Computing (HPC)	TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems
High-performance computing (HPC) systems	A Unified Ontology for Scalable Knowledge Graph–Driven Operational Data Analytics in High-Performance Computing Systems
High-performance numerical computing	Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP
HPC	THAPI: Tracing Heterogeneous APIs Priority-BF: a Task Manager for Priority-Based Scheduling An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment Scalable Compression of Massive Data Collections on HPC Systems Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture HPC Software as a Service: A Flexible Approach to Data Logistics Dynamic reconfiguration for malleable applications using RMA Dynamic Data Redistribution for Malleable MPI Frameworks through Virtual Topologies HPC Benchmark Game: Comparing Programming Languages Regarding Energy-Efficiency for Applications from the HPC Field A framework for flooding early warning leveraging AI, HPC, and computing continuum Heterogeneous computing, storage and network infrastructures for medical applications FLEXI: Scale-resolving simulations of compressible turbulence on modern HPC systems
HPC	Alumet: a modular framework to standardize the measurement of energy consumption
HPC applications	SIMON: A Simple Monitoring Framework for Heterogeneous Application Observability
HPC Cluster	SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure
HPC Edge-To-Cloud	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum
HPC workloads	ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Hybrid DMA-Cache	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors
Hyper-parameter optimization	H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
I
I/O malleability	Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems
imperfect verification	Partial Detectors Versus Replication To Cope With Silent Errors
Importance Sampling	IAUG: Accelerating Augmentation with Importance Sampling in Deep Neural Network Training
in situ	Priority-BF: a Task Manager for Priority-Based Scheduling
Independent Learning	Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms
Index structures	GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory
Inference	DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
Inference Acceleration	CoSF: A Co-Optimization Framework for Operator Splitting and Fusion
Inference Optimization	Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism
Intermediate language	Accelerating SWIRL Workflows: A High-Performance Rust Backend for Distributed Execution
IoT Sensors	On-Device Federated Learning for Remote Alpine Livestock Monitoring
IR	AskLLVM: LLVM Code Generation for GPUs for Graph Algorithms
iterative algorithm	Partial Detectors Versus Replication To Cope With Silent Errors
J
Job Scheduling	WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment
K
Kernel pairing	CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency
Knowledge Graph (KG)	A Unified Ontology for Scalable Knowledge Graph–Driven Operational Data Analytics in High-Performance Computing Systems
Kubernetes	Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems Light Weight Scalable DevOps for Cloud Robotics
Kubernetes	Alumet: a modular framework to standardize the measurement of energy consumption
KV cache	EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
L
Large deep neural network training	H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
Large Graph	DCI: An Efficient Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
Large Language Models	Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism DynoInfer: Adaptive Resource Orchestration for LLM Inference on Resource-Constrained PCs CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency Cache Management for Mixture-of-Experts LLMs ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs.
Large-scale graphs	External GPU Biconnected Components
latency detection	Partial Detectors Versus Replication To Cope With Silent Errors
lazy evaluation	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
LBM	Disaggregated Design for GPU-Based Volumetric Data Structures
Livestock Monitoring	On-Device Federated Learning for Remote Alpine Livestock Monitoring
LLM	Tutoring LLM into a Better CUDA Optimizer
LLM Inference	AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
LLM serving	EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
LLMs	A Sparsity Predicting Approach for General Large Language Models via Activation Pattern Clustering Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors
LLVM	Noise injection for performance bottleneck analysis AskLLVM: LLVM Code Generation for GPUs for Graph Algorithms
Load Balancing	Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
load-balancing	Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems
log-structured file system	CSGC: Collaborative File System Garbage Collection with Computational Storage
loop fusion	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
loop tiling	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
Low-Power GPU Rendering	EAGER: Energy-Aware 3D Gaussian Splatting on Embedded Parallel Heterogeneous Systems
M
Machine Learning	Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability Building Parallel Machine Learning Workflows in PyCOMPSs: The Case Study of Tsunami Forecasting Boosting Performance of Counting Queries in Machine Learning Applications with a ccNUMA-aware Implementation
Machine Learning Workflows	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems
Malleability	Dynamic Data Redistribution for Malleable MPI Frameworks through Virtual Topologies Malleability in LAIK with MPI Dynamic Processes and PSets
Medical applications	Heterogeneous computing, storage and network infrastructures for medical applications
Memory Hierarchy	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors
Memory Mapping	Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism
Memory Resource Provisioning	ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Memory Saving	Saving Memory via Residual Reduction for DNN Training with Compressed Communication
Metaheuristics	Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Meteorological model	Mixed precision over GPU applied to a Microphysics model
Mixture-of-Experts	Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
MLIR	Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR
Model Compression	Light-DiT: An Importance-Aware Dynamic Compression Framework for Diffusion Transformers
Model Partitioning	ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs.
Modelling	Towards Digital Twins of HPC Data Centres Modelling Infrastructure and HPC Systems for IT-Zauber
molecular dynamics	Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
molecular dynamics simulation	Performance optimization of GROMACS on modern Hardware
Monitoring	SIMON: A Simple Monitoring Framework for Heterogeneous Application Observability Heterogeneous computing, storage and network infrastructures for medical applications
MPI	Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication Dynamic reconfiguration for malleable applications using RMA Malleability in LAIK with MPI Dynamic Processes and PSets
MPI Collective I/O	ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems
MT-3000	TH-Pulse: A Study on Hardware-Software Co-Designed Framework for LLM Training and Inference on the Tianhe new-generation supercomputer
Multi-Agent Reinforcement Learning	Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms
Multi-DNN accelerators	BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
Multi-DNN Inference Serving	TopServe: Task-Operator Co-Scheduling for Efficient Multi-DNN Inference Serving on GPUs
Multi-GPU Training	Accelerating Independent Multi-Agent Reinforcement Learning on Multi-GPU Platforms
multi-rail communication	Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
multi-site workflows	Simplifying distributed workflows: A portable approach for Cloud and HPC
Multi-threaded	ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs
multilinear algebra	Near-optimal contraction strategies for the scalar product in the tensor-train format
N
Neural networks	SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches
Neural operators	DiffNO: Neural Operator Learning using Physically Structured Constrained Diffusion Model
Neuromorphic Computing	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
noise injection	Noise injection for performance bottleneck analysis
nonblocking execution	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
Nonlinear constrained optimization	Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP
Novel Architectures	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
Nowcasti	A framework for flooding early warning leveraging AI, HPC, and computing continuum
numerical linear algebra	Near-optimal contraction strategies for the scalar product in the tensor-train format
O
Observability	An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment SIMON: A Simple Monitoring Framework for Heterogeneous Application Observability
Offloading	On-the-fly Performance Analysis of Asynchronous Parallel Execution
Omnitrees	Efficient Anisotropic Mesh Refinement with Omnitrees ...or How to Get Cat GIFs Into Your Paper
On-chip memory	BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
Online trainning	SONet: Towards Practical Online Neural Network for Enhancing Hard-To-Predict Branches
OpenCL benchmarks	OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
OpenMP	Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication Tracking the Critical Path of Execution for GPU Offloading Applications On-the-fly Performance Analysis of Asynchronous Parallel Execution
Operational data analytics (ODA)	A Unified Ontology for Scalable Knowledge Graph–Driven Operational Data Analytics in High-Performance Computing Systems
Opreation Fusion	CoSF: A Co-Optimization Framework for Operator Splitting and Fusion
Opreation Split	CoSF: A Co-Optimization Framework for Operator Splitting and Fusion
Optimal Transport	WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy
Optimizations	Tutoring LLM into a Better CUDA Optimizer
Osteosarcoma	A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans
Out-of-core processing	External GPU Biconnected Components
oversubscription	CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines
P
PageRank	Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations
Parallel	Disaggregated Design for GPU-Based Volumetric Data Structures
Parallel algorithms	Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD Partitioning In-Place on Massively Parallel Systems
Parallel Branch-and-Bound	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
Parallel Computing	ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs Scalable Compression of Massive Data Collections on HPC Systems Building Parallel Machine Learning Workflows in PyCOMPSs: The Case Study of Tsunami Forecasting
Parallel Computing on GPUs	Wedge-Parallel Triangle Counting for GPUs
parallel computing performance	OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
Parallel Graph Computations	ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems
Parallel Processing	Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency
Parallel Programming	Polymorphic Higher-Order GPU Kernels Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability DCG-DDQ: A Directed Cyclic Graph Based Task Computing System
Parallel Programming Automation	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Parallel Programming Models	Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models
Parallel SGD	Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD
Parallel skeletons	Exploiting highly heterogenous systems with stencil applications
parallel-in-time	Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
Parallelism	A Case Study for Resolving Composability Issues Using a Shared CPU Resource Coordinator Evaluating Energy Efficiency of Genomics Algorithms on Processing-in-Memory Architectures H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
Parsl	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems
Particle Swarm Optimization	Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Partitioning	Partitioning In-Place on Massively Parallel Systems
Peer-to-peer Networks	Supervised Distributed Computing
Performance	Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe Tracking the Critical Path of Execution for GPU Offloading Applications
performance analysis	Making MPI Collective Operations Visible: Understanding Their Utility and Algorithmic Insights Noise injection for performance bottleneck analysis
Performance Analysis Tools	On-the-fly Performance Analysis of Asynchronous Parallel Execution
Performance evaluation	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum Alumet: a modular framework to standardize the measurement of energy consumption
Performance optimaztion	Uniform Dense Blocking for Efficient Sparse LU Factorization in First-principles Materials Simulation
Performance Prediction	Millibenchmarking: Using Graph Sampling for Ranking GPU PageRank Implementations
Performance Tuning	Thread Monitoring Tool: transparent characterization of threading patterns with eBPF
Permissioned blockchain framework	Modifying the HyperLedger Fabric Blockchain Architecture to increase throughput and decrease transaction rejections
Phase Analysis	SimPoint+: More Stable, Accurate and Efficient Program Analysis
PMIx	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems
Polyhedral compilation	Advanced Techniques in Polyhedral Model-Based Compilers for Efficient and Cross-Platform Code Generation on Multicore Processors
Portability	Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe
Portability	A Portable Branch-and-Bound Algorithm for Cross-Architecture Multi-GPU Systems
Power consumption	Evaluating Energy Efficiency of Genomics Algorithms on Processing-in-Memory Architectures
Privacy-Preserving Machine Learning	Federated Learning in the Edge-Cloud Continuum: A Task-Based Approach with Colony
Processing-in-Memory	Evaluating Energy Efficiency of Genomics Algorithms on Processing-in-Memory Architectures
Program Analysis	SimPoint+: More Stable, Accurate and Efficient Program Analysis
Programming	Tutoring LLM into a Better CUDA Optimizer
Programming Languages	HPC Benchmark Game: Comparing Programming Languages Regarding Energy-Efficiency for Applications from the HPC Field
Programming models	THAPI: Tracing Heterogeneous APIs
Pruning	2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy
PTX	AskLLVM: LLVM Code Generation for GPUs for Graph Algorithms
Pyramidal Analysis	Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
Q
QoS	Auction-based Placement of Functions in the Fog at Scale
quality of service	Auction-based Placement of Functions in the Fog at Scale
Quantization	CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
Quantum Algorithm	Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency
Quantum Data Storage	Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency
Quantum Image Processing	Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency
Quantum Signal Processing	Quantum Delta Encoding: Optimizing Data Storage on Quantum Computers with Resource Efficiency
R
Radio Astronomy	Breaking the I/O Barrier: 1.2 Tb/s Ethernet Packet Processing on a GPU
Random Walks	ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems
Real-Time Rendering Performance	EAGER: Energy-Aware 3D Gaussian Splatting on Embedded Parallel Heterogeneous Systems
Recommender system	ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory
Reduced precision	Mixed precision over GPU applied to a Microphysics model
refactoring	SYCL for Energy-Efficient Computational Astrophysics: the case of DPEcho
Rejection Sampling	ScaleRunner: A Fast MPI-based Random Walk Engine for Multi-CPU Systems
Remote Offloading	Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication
Reproducibility	Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability
Residual	Saving Memory via Residual Reduction for DNN Training with Compressed Communication
resilience	Partial Detectors Versus Replication To Cope With Silent Errors
Resource Adaptivity	ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Resource Allocation	Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators
Resource Management	An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment
Resource usage coordination	A Case Study for Resolving Composability Issues Using a Shared CPU Resource Coordinator
RISC-V	ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs Open, cross-architecture acceleration of data analytics with SYCL and RISC-V
RMA	Dynamic reconfiguration for malleable applications using RMA
ROS2	Light Weight Scalable DevOps for Cloud Robotics
RTL simulation	Scalable Code Generation for RTL Simulation of Deep Learning Accelerators with MLIR SimPart: A Simple Yet Effective Replication-aided Partitioning Algorithm for Logic Simulation on GPU
Run-off	A framework for flooding early warning leveraging AI, HPC, and computing continuum
Runtime Analysis Tools	Tracking the Critical Path of Execution for GPU Offloading Applications
Runtime systems	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Rust	Alumet: a modular framework to standardize the measurement of energy consumption Accelerating SWIRL Workflows: A High-Performance Rust Backend for Distributed Execution
S
SaaS	HPC Software as a Service: A Flexible Approach to Data Logistics
Sampled Simulation	SimPoint+: More Stable, Accurate and Efficient Program Analysis
Scalable	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum
Scalable Vector Extension	ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
scalar product	Near-optimal contraction strategies for the scalar product in the tensor-train format
Scheduling	Scheduling Task and Data Parallelism in Array Languages with Work Assisting Priority-BF: a Task Manager for Priority-Based Scheduling Approximation Bounds for SLACK on Identical Parallel Machines ScheInfer: Efficient Inference of Large Language Models with Task Scheduling on Moderate GPUs. CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines
scheduling	Experimental Evaluation of Scheduling Strategies for Evolving Workflow-Based Applications
Scheduling and resource management	Design and Operation of Elastic GPU-pooling on Campus
Scientific Workflows	Enabling Elasticity in Scientific Workflows for High Performance Computing Systems Green Energy Aware Scheduling of Scientific Workflows with Flexible Deadlines
Scientific workflows	Accelerating SWIRL Workflows: A High-Performance Rust Backend for Distributed Execution
Sequencing read alignment	SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture
Sequential Least-Squares Quadratic Programming(SLSQP)	Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP
Serverless	ServerlessRec: Fast Serverless Inference for Embedding-based Recommender Systems with Disaggregated Memory Auction-based Placement of Functions in the Fog at Scale
Serverless Computing	HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration
service-level agreement	Auction-based Placement of Functions in the Fog at Scale
Shared Memory	ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
silent error	Partial Detectors Versus Replication To Cope With Silent Errors
Simulation	Disaggregated Design for GPU-Based Volumetric Data Structures Comparative Analysis of Algorithms for Malleability Decision-Making in Applications and File Systems In-Situ Techniques for the Efficient Coupling of Complex Plasma Turbulence Simulations: GENE and GENE-X
Simulation Point	SimPoint+: More Stable, Accurate and Efficient Program Analysis
simulations	FLEXI: Scale-resolving simulations of compressible turbulence on modern HPC systems
Skipping Non-Zero (SkipNZ)	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration
SLA	Auction-based Placement of Functions in the Fog at Scale
SLACK	Approximation Bounds for SLACK on Identical Parallel Machines
Slurm	SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure
SMT Processors	WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy
Software architecture	Alumet: a modular framework to standardize the measurement of energy consumption
Software Development	From Reactive Debugging to Proactive Detection: AI for Performance-Aware Software Development
software reengineering	OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
Sparse Architectures	SkipNZ: Non-Zero Value Skipping for Efficient CNN Acceleration
Sparse LU fatorization	Uniform Dense Blocking for Efficient Sparse LU Factorization in First-principles Materials Simulation
Sparse Matrix Multiplication	AlphaSparseTensor: Discovering Faster Sparse Matrix Multiplication Algorithms on GPUs for LLM Inference
Sparse Tensor Cores	2:4 Pruning on Edge Devices: Performance, Energy Efficiency and Accuracy
sparse vectors	Efficient handling of sparse vectors for parallel nonblocking execution in GraphBLAS
Spiking Neural Network	ReSpike: A Co-Design Framework for Evaluating SNNs on ReRAM-based Neuromorphic Processors
Staleness	Interval-Asynchrony: Delimited Intervals of Localised Asynchrony for Fast Parallel SGD
Statistical Analysis	Optimized Parallel Metaheuristics for Big Data Processing on GPUs with Apache Spark
Stencil	Exploiting highly heterogenous systems with stencil applications
stencil operations	Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
Stream Processing	SProBench: Stream Processing Benchmark for High Performance Computing Infrastructure
Streaming Graph Processing Systems (SGPSs)	A Comparative Study of Streaming Graph Processing Systems
Sub-batching and Sub-batch merging	BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
Subtoken	EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
Sunway architecture	SWBWA: A Highly Efficient NGS Aligner on the New Sunway Architecture
supercomputers	Targeted data movement optimizations for emerging heterogeneous supercomputers
Sustainable AI	Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators
sustainable computing	Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
Swirl	Accelerating SWIRL Workflows: A High-Performance Rust Backend for Distributed Execution
SYCL	Open, cross-architecture acceleration of data analytics with SYCL and RISC-V
system throughput	KarmaPM: Reward-Driven Power Manager Power Scheduling on Multicore Multiprocessor Systems for Maximizing Throughput and Fairness
Systems	A Holistic Approach to Complexity Management and Multidimensional Analysis in Computing Continuum
Systems for Machine Learning	NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning
T
Task graph	ParSolGen (Parallel Solvers Generator) - an automated numerical parallel programs generator for distributed memory parallel computers
Task graph computing system	DCG-DDQ: A Directed Cyclic Graph Based Task Computing System
task graph parallelism	SimPart: A Simple Yet Effective Replication-aided Partitioning Algorithm for Logic Simulation on GPU
Task-Based Programming	Federated Learning in the Edge-Cloud Continuum: A Task-Based Approach with Colony
Task-Operator Co-Scheduling	TopServe: Task-Operator Co-Scheduling for Efficient Multi-DNN Inference Serving on GPUs
Task-parallel linear algebra computations	Efficient Task Graph Scheduling for Parallel QR Factorization in SLSQP
Tasking	On-the-fly Performance Analysis of Asynchronous Parallel Execution
TBA1	WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge Performance Analysis of Container-in-VM Architectures: A Study on Hypervisor Isolation and Lightweight OS Integration Enabling RDMA and GPUs in Rootless Kubernetes for Accelerated HPC and AI Applications
TBA2	WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge Performance Analysis of Container-in-VM Architectures: A Study on Hypervisor Isolation and Lightweight OS Integration Enabling RDMA and GPUs in Rootless Kubernetes for Accelerated HPC and AI Applications
TBA3	WebAssembly and Unikernels: A Comparative Study for Serverless at the Edge Performance Analysis of Container-in-VM Architectures: A Study on Hypervisor Isolation and Lightweight OS Integration Enabling RDMA and GPUs in Rootless Kubernetes for Accelerated HPC and AI Applications
tensor contraction ordering	Near-optimal contraction strategies for the scalar product in the tensor-train format
tensor decomposition	Near-optimal contraction strategies for the scalar product in the tensor-train format
tensor-train decomposition	Near-optimal contraction strategies for the scalar product in the tensor-train format
Termination	Fault-Tolerant Distributed Federated Learning with Adaptive Termination Detection
testbed	Auction-based Placement of Functions in the Fog at Scale
Thread-to-Core Allocation Policies	WAPA: A Workload-Agnostic CPI-Based Thread-to-Core Allocation Policy
Tianhe new-generation supercomputer	TH-Pulse: A Study on Hardware-Software Co-Designed Framework for LLM Training and Inference on the Tianhe new-generation supercomputer
Top500	Analysis of the carbon footprint of HPC
Trace-driven Simulation	TSim4CXL: Trace-driven Simulation Framework for CXL-based High-Performance Computing Systems
Tracing and monitoring	THAPI: Tracing Heterogeneous APIs
transfer learning	Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans
Transform code	Tutoring LLM into a Better CUDA Optimizer
Transformer	A framework for flooding early warning leveraging AI, HPC, and computing continuum TH-Pulse: A Study on Hardware-Software Co-Designed Framework for LLM Training and Inference on the Tianhe new-generation supercomputer
Triangle Counting	Wedge-Parallel Triangle Counting for GPUs
Trusted Execution Environment	ParTEE:A Framework for Secure Parallel Computing of RISC-V TEEs
Tsunami Forecasting	Building Parallel Machine Learning Workflows in PyCOMPSs: The Case Study of Tsunami Forecasting
V
Vector Processor	A Hybrid DMA-Cache Mechanism to Leverage Memory Bandwidth in Massive-Parallel Processors Portable and Scalable FPGA Emulation of a Massive-Parallel Vector Processor
Vector Unit	ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
Vertical scaling	ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences CGP-Graphless: Towards Efficient Serverless Graph Processing via CPU-GPU Pipelined Collaboration
virtual machines	CoreWaterfall: a Virtual-Core-Focused Scheduling and Allocation Algorithm for Oversubscribed Virtual Machines
virtual topologies	Dynamic Data Redistribution for Malleable MPI Frameworks through Virtual Topologies
Virtualization	Design and Operation of Elastic GPU-pooling on Campus
Vision Transformer	CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
Visualization	Exploring Flow Fields at Scale: GPU-Accelerated Scientific Visualization for Exascale CFD
W
Weather Radar	A framework for flooding early warning leveraging AI, HPC, and computing continuum
Wedge-Parallel Approaches	Wedge-Parallel Triangle Counting for GPUs
WHPC	Targeted data movement optimizations for emerging heterogeneous supercomputers
Workflows	Building Parallel Machine Learning Workflows in PyCOMPSs: The Case Study of Tsunami Forecasting
workload prediction	Container Workload Prediction Using Deep Domain Adaptation in Transfer Learning
Write optimization	GECKO: A Write-optimized Hybrid Index based on Disaggregated Memory