Talk Keyword Index

TALK KEYWORD INDEX

This page contains an index consisting of author-provided keywords.

Shortcuts: A B C D E F G H I J K L M N O P Q R S T V W

A
Accelerator	Watt: A Write-optimized RRAM-based Accelerator for Attention
Accelerators	MEPAD: A Memory-efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks
Ad-Hoc file system	Fault tolerant in the Expand Ad-Hoc parallel file system (Artifact)
adaptive-precision	Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
Address translation	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
Adversarial Attack	Disttack: Graph Adversarial Attacks Toward Distributed GNN Training
AGCM	Pipe-AGCM: A Fine-grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation Model
AI Accelerators	WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators
AIOps	LogRCA: Log-based Root Cause Analysis for Distributed Services
Alternative Basis Method	Communication Minimizing Toom-Cook Algorithms
Application workflows	Making easier the life-cycle management of complex application workflows
approximate spanning tree	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Approximation	Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
Approximation algorithm	A 1.25(1+ε)-Approximation Algorithm for Scheduling with Rejection Costs Proportional to Processing Times (Artifact)
Approximation algorithms	Makespan Minimization for Scheduling on Heterogeneous Platforms with Precedence Constraints QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique
ARMv8-A (NEON)	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Assembly generation	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Asynchronous Federated Learning	A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Attention	Watt: A Write-optimized RRAM-based Accelerator for Attention
Attention Importance	ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
Auto-tuning	Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Artifact)
Automated tool generation	A Mechanism to Generate Interception Based Tools for HPC Libraries
Automatic Dimension Reduction	Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques
B
Backdoor watermark	VeriChroma: Ownership Verification for Federated Models via RGB Filters
backtracking	Investigating Portability in Chapel for Tree-based Optimization on GPU-powered Clusters
Batch scheduling resource allocation	Evaluation of CPU constraining mechanisms in the LHC ALICE experiment Grid
Benchmarking	Deconstructing HPL-MxP benchmark: a numerical perspective
Bilinear Algorithms	Communication Minimizing Toom-Cook Algorithms
Bit Flipping Key Encapsulation	A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE
Blockchains	Towards High-Performance Transactions via Hierarchical Blockchain Sharding
Boolean matrix multiplication	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Breadth-First Search	GPU-Accelerated BFS for Dynamic Networks
Byzantine robustness	Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
C
C++ Coroutine	TaroRTL: Accelerating RTL Simulation using Coroutine-based Heterogeneous Task Graph Scheduling
Cache	GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework
Cache Efficiency	Vectorizing Sparse Blocks of Graph Matrices for SpMV
Cache Side-channel Attack	Efficient RNIC Cache Side-channel Attack Detection through DPU-driven Architecture
CGRA Mapping	ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRA
Chained memory access	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
chapel	Investigating Portability in Chapel for Tree-based Optimization on GPU-powered Clusters
Checkpoint	AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems
Checkpointing	Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
Classifier Retraining	Improving Generalization and Personalization in Long-Tailed Federated Learning via Classifier Retraining
Cloud applications	sAirflow: Adopting Serverless in a Legacy Workflow Scheduler
Cloud bursting	Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems
Cloud Computing	Context-aware Runtime Type Prediction for Heterogeneous Microservices Cloud-native GPU-enabled architecture for parallel video encoding DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling A Framework for Automated Parallel Execution of Scientific Multi-Workflow Applications in the Cloud with Work Stealing
Cloud migration	sAirflow: Adopting Serverless in a Legacy Workflow Scheduler
Coarse-grained Reconfigurable Array	ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRA
Code Generation	Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication
Coded Distributed Computation	Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems
collective i/o	A High-Performance Collective I/O Framework Leveraging Node-Local Persistent Memory
Columnar data format	Parallel Writing of Nested Data in Columnar Formats (Artifact)
Combinatorial optimization	QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique
communication-computation overlap	Pipe-AGCM: A Fine-grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation Model
Compiler	ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRA
Computation-in-Memory	A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE
Computer arithmetic	Deconstructing HPL-MxP benchmark: a numerical perspective
Computer Engineering	E4 at the forefront of European HPC
Concurrency	A Fast Wait-Free Solution to Read-Reclaim Races in Reference Counting (Artifact) FlexiGran: Flexible Granularity Locking in Hierarchies
concurrent data structures	How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures (Artifact)
congested clique	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Congestion control	Hybrid Congestion Control for BXI-based Interconnection Networks
Connect components	ALZI: An Improved Parallel Algorithm for Finding Connected Components in Large Graphs
Constraint Programming	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Convolutional neural networks	MEPAD: A Memory-efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks
Cost-effectiveness	PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
CPU allocation techniques	Evaluation of CPU constraining mechanisms in the LHC ALICE experiment Grid
Cross-Search	CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
Cross-shard transaction	Towards High-Performance Transactions via Hierarchical Blockchain Sharding
Cuckoo hashing	Compact Parallel Hash Tables on the GPU (Artifact)
CUDA	Compact Parallel Hash Tables on the GPU (Artifact) Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Artifact)
D
Data centers	Hybrid Congestion Control for BXI-based Interconnection Networks
Data compression	Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
Data movement	Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems
Data movement strategies	Harnessing Data Movement Strategies to Optimize Performance-Energy Efficiency of Oil & Gas Simulations in HPC
Data Stream Processing	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Data-preprocessing	On the use of hybrid computing for accelerating EEG preprocessing
Deep learning	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Deep neural network	VeriChroma: Ownership Verification for Federated Models via RGB Filters
DeepFake detection	FakeGuard: A Novel Accelerator Architecture for Deepfake Detection Networks
Dense matrix-matrix multiplication	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Design-space explorations	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Developing and deploying HPC and AI/ML applications	ParaTools Pro for E4S
Differentiated Services	Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
dimension reduction	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Distributed aggregation	Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
Distributed Computing	GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework
Distributed machine Learning	Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
Distributed Systems	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
Distributed Training	Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models Disttack: Graph Adversarial Attacks Toward Distributed GNN Training
DMTCP	AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems
DNN accelerator	FakeGuard: A Novel Accelerator Architecture for Deepfake Detection Networks
Domain-specific Language	Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication
Dominators	FlexiGran: Flexible Granularity Locking in Hierarchies
DPU	Efficient RNIC Cache Side-channel Attack Detection through DPU-driven Architecture
Dynamic Frontier approach	DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs (Artifact)
Dynamic networks	GPU-Accelerated BFS for Dynamic Networks
E
Earliest Deadline First scheduling scheme	Deadline-driven Enhancements and Response Time Analysis of ROS2 Multi-threaded Executors
EDF	On the use of hybrid computing for accelerating EEG preprocessing
Edge Computing	Resource-Aware Heterogeneous Federated Learning with Specialized Local Models
Edge technologies	Supporting HPC Centers: challenges, horror stories and best practices
EEG	On the use of hybrid computing for accelerating EEG preprocessing
Energy	PCTC: Hardware and Software Co-Design for Pruned Capsule Networks on Tensor Cores
Energy minimization	A 1.25(1+ε)-Approximation Algorithm for Scheduling with Rejection Costs Proportional to Processing Times (Artifact)
ensemble simulation	Efficient Coupling Streaming AI and Ensemble Simulations on HPC Clusters
Epilepsy	On the use of hybrid computing for accelerating EEG preprocessing
European HPC	E4 at the forefront of European HPC
Expand Ad-Hoc	Fault tolerant in the Expand Ad-Hoc parallel file system (Artifact)
Explicit Sharing	ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
Extreme-scale Scientific Software Stack	ParaTools Pro for E4S
F
FaaS	sAirflow: Adopting Serverless in a Legacy Workflow Scheduler
Fast Long Integer Multiplication	Communication Minimizing Toom-Cook Algorithms
Fault tolerance	AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems Fault tolerant in the Expand Ad-Hoc parallel file system (Artifact)
Federated Learning	Resource-Aware Heterogeneous Federated Learning with Specialized Local Models Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks FLUK: Protecting Federated Learning against Malicious Clients for Internet of Vehicles Improving Generalization and Personalization in Long-Tailed Federated Learning via Classifier Retraining FedGG: Leveraging Generative Adversarial Networks and Gradient Smoothing for Privacy Protection in Federated Learning VeriChroma: Ownership Verification for Federated Models via RGB Filters Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
FedGNNs	Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
FFT	On the use of hybrid computing for accelerating EEG preprocessing
Fine-grained/hierarchical locking	FlexiGran: Flexible Granularity Locking in Hierarchies
floating-point arithmetic	Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
Fog Computing	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Folded Mapping Strategy	A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE
FPGA	Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL Efficient RNIC Cache Side-channel Attack Detection through DPU-driven Architecture Pre-Scheduling of Affine Loops for HLS Pipelining
Fully Homomorphic Encryption	Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU
Function-as-a-Service	sAirflow: Adopting Serverless in a Legacy Workflow Scheduler
G
gem5 simulations	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Generate adversarial networks	FedGG: Leveraging Generative Adversarial Networks and Gradient Smoothing for Privacy Protection in Federated Learning
genome analysis	(re)Assessing PiM Effectiveness for Sequence Alignment
GNN inference	GDL-GNN: Applying GPU Dataloading of Large Datasets for Graph Neural Network Inference
GPU	Harnessing Data Movement Strategies to Optimize Performance-Energy Efficiency of Oil & Gas Simulations in HPC Cloud-native GPU-enabled architecture for parallel video encoding GPU-Accelerated BFS for Dynamic Networks On the use of hybrid computing for accelerating EEG preprocessing GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU Mixed precision randomized low-rank approximation with GPU tensor cores
GPU architecture	Predicting GPU kernel's performance on upcoming architectures
gpu computing	Investigating Portability in Chapel for Tree-based Optimization on GPU-powered Clusters
GPU Programming	Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Artifact)
Gradient Compression	A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Graph algorithms	ALZI: An Improved Parallel Algorithm for Finding Connected Components in Large Graphs QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique
Graph Learning	Context-aware Runtime Type Prediction for Heterogeneous Microservices
Graph Neural Network	Disttack: Graph Adversarial Attacks Toward Distributed GNN Training
Graph Neural Networks	Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
Graph partition	GDL-GNN: Applying GPU Dataloading of Large Datasets for Graph Neural Network Inference
Grid computing resource management	Evaluation of CPU constraining mechanisms in the LHC ALICE experiment Grid
H
Hamming space	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Hash function	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
Heterogeneity	Context-aware Runtime Type Prediction for Heterogeneous Microservices
Heterogeneous Graph Neural Network	ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
heterogeneous platforms	Makespan Minimization for Scheduling on Heterogeneous Platforms with Precedence Constraints
HGNN Accelerator	ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
Hierarchical data structures	FlexiGran: Flexible Granularity Locking in Hierarchies
Hierarchical sharding	Towards High-Performance Transactions via Hierarchical Blockchain Sharding
High Energy Physics	Parallel Writing of Nested Data in Columnar Formats (Artifact)
High performance	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
high performance computing	Deconstructing HPL-MxP benchmark: a numerical perspective Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
High-Level Synthesis	Pre-Scheduling of Affine Loops for HLS Pipelining
High-Performance Computing	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact) Hybrid Congestion Control for BXI-based Interconnection Networks QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
high-productivity	Investigating Portability in Chapel for Tree-based Optimization on GPU-powered Clusters
HIP	Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Artifact)
HLS	Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
HoL Blocking	Hybrid Congestion Control for BXI-based Interconnection Networks
HPC	Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems Light-weight prediction for improving energy consumption in HPC platforms (Artifact) Scheduling distributed I/O resources in HPC systems Supporting HPC Centers: challenges, horror stories and best practices Making easier the life-cycle management of complex application workflows
hpc io	A High-Performance Collective I/O Framework Leveraging Node-Local Persistent Memory
HPC-AI workflow	Efficient Coupling Streaming AI and Ensemble Simulations on HPC Clusters
HTTP Adaptive Streaming	Cloud-native GPU-enabled architecture for parallel video encoding
Huge page	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
Hybrid Clouds	PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
Hybrid Parallelism	Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models
I
I/O Complexity	Communication Minimizing Toom-Cook Algorithms
I/O forwarding	Scheduling distributed I/O resources in HPC systems
Iceberg hashing	Compact Parallel Hash Tables on the GPU (Artifact)
Image Processing	ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRA
Implicit Sharing	ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
Importance	Watt: A Write-optimized RRAM-based Accelerator for Attention
Improved Multi-Dimensional Dichotomy	CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
Industrial Control	Node Bundle Scheduling: An Ultra-Low Latency Traffic Scheduling Algorithm for TAS-based Time-Sensitive Networks
Inference	Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Injection throttling	Hybrid Congestion Control for BXI-based Interconnection Networks
Instruction-Set Architecture	ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
Intel Data Center GPU	ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels
inter-FPGA communication	Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
Interconnection networks	Hybrid Congestion Control for BXI-based Interconnection Networks
Internet of Vehicles	FLUK: Protecting Federated Learning against Malicious Clients for Internet of Vehicles
Intervals	Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
IoT applications	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Iterative solver	Deconstructing HPL-MxP benchmark: a numerical perspective
J
Joint Optimization	A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
K
Key-Value store	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
Key-Value Stores	Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
Kubernetes	Cloud-native GPU-enabled architecture for parallel video encoding
L
Large DNN	CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
Large Language Models	Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators
Linear algebra	Deconstructing HPL-MxP benchmark: a numerical perspective
LLM	OMPGPT: A Generative Pre-trained Transformer Model for OpenMP
Load Balance	Vectorizing Sparse Blocks of Graph Matrices for SpMV
Local Update	A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Locality-awareness	Towards High-Performance Transactions via Hierarchical Blockchain Sharding
lock-free	How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures (Artifact)
Log Analysis	LogRCA: Log-based Root Cause Analysis for Distributed Services
Long-tailed and Non-IID Data	Improving Generalization and Personalization in Long-Tailed Federated Learning via Classifier Retraining
loop pipelining	Pre-Scheduling of Affine Loops for HLS Pipelining
low-rank approximations	Mixed precision randomized low-rank approximation with GPU tensor cores
LU factorization	Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation
M
Machine Learning	Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels Light-weight prediction for improving energy consumption in HPC platforms (Artifact)
Manage and launch multi-node multi-user clusters	ParaTools Pro for E4S
Matrix multiplication	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Matrix-vector multiplication	PCTC: Hardware and Software Co-Design for Pruned Capsule Networks on Tensor Cores
Maximum weighted clique	QClique: Optimizing Performance and Accuracy in Maximum Weighted Clique
Medical Image Splitting	PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
Memory optimization	MEPAD: A Memory-efficient Parallelized Direct Convolution Algorithm for Deep Neural Networks
memory systems	(re)Assessing PiM Effectiveness for Sequence Alignment
Micro-batch-based Data Parallelism	CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
minimum spanning tree	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
Mixed precision	Deconstructing HPL-MxP benchmark: a numerical perspective
mixed precision algorithms	Mixed precision randomized low-rank approximation with GPU tensor cores
mixed-precision	Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
ML Ensembles	Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques
MLIR	Pre-Scheduling of Affine Loops for HLS Pipelining
Model watermarking	VeriChroma: Ownership Verification for Federated Models via RGB Filters
MPI	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
MPI-IO	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact)
Multi-Get	Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
Multi-GPU	GDL-GNN: Applying GPU Dataloading of Large Datasets for Graph Neural Network Inference
Multi-Objective Optimization	PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
Multicore processors	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Multigrid	Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication
Multithreading	Parallel Writing of Nested Data in Columnar Formats (Artifact)
N
near-data processing	(re)Assessing PiM Effectiveness for Sequence Alignment
Network Digital Twin	Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
Neural Architecture Search	Resource-Aware Heterogeneous Federated Learning with Specialized Local Models
Neural Network	Athena: Add More Intelligence to RMT-based Network Data Plane with Low-bit Quantization
number of rounds	Boolean Matrix Multiplication for Highly Clustered Data on the Congested Clique
O
object storage targets	Scheduling distributed I/O resources in HPC systems
Octree	Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication
Oil & Gas Exploration	Harnessing Data Movement Strategies to Optimize Performance-Energy Efficiency of Oil & Gas Simulations in HPC
One-Sided Communication	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact)
online learning	Efficient Coupling Streaming AI and Ensemble Simulations on HPC Clusters
OpenMP	OMPGPT: A Generative Pre-trained Transformer Model for OpenMP
Operation Fusion	ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
Optimistic Synchronisation	Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
Optimization	ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels
Ownership Verification	VeriChroma: Ownership Verification for Federated Models via RGB Filters
P
PageRank algorithm	DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs (Artifact)
Parallel	OMPGPT: A Generative Pre-trained Transformer Model for OpenMP
parallel algorithm	Pipe-AGCM: A Fine-grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation Model
Parallel algorithms	ALZI: An Improved Parallel Algorithm for Finding Connected Components in Large Graphs DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs (Artifact)
Parallel computing	FedGG: Leveraging Generative Adversarial Networks and Gradient Smoothing for Privacy Protection in Federated Learning
Parallel Discrete Event Simulation	Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
Parallel file system	AdapCK: Optimizing I/O for Checkpointing on Large-scale High Performance Computing Systems Fault tolerant in the Expand Ad-Hoc parallel file system (Artifact) Scheduling distributed I/O resources in HPC systems
parallel I/O	Scheduling distributed I/O resources in HPC systems
Parallel Programming	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
Parallel Region Classification	Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques
Parallel writing	Parallel Writing of Nested Data in Columnar Formats (Artifact)
Performance and energy efficiency	Harnessing Data Movement Strategies to Optimize Performance-Energy Efficiency of Oil & Gas Simulations in HPC
Performance Counters	Efficient Code Region Characterization through Automatic Performance Counters Reduction using Machine Learning Techniques
Performance optimization	Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation
Performance projection	Predicting GPU kernel's performance on upcoming architectures
Performance tools	A Mechanism to Generate Interception Based Tools for HPC Libraries
Persistent Memory	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact) A High-Performance Collective I/O Framework Leveraging Node-Local Persistent Memory
Pipeline Parallelism	CSIMD: Cross-Search Algorithm with Improved Multi-Dimensional Dichotomy for Micro-batch-based Pipeline Parallel Training in DNN
pipelining scheme	Pipe-AGCM: A Fine-grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation Model
Platform development	E4 at the forefront of European HPC
Poisoning Attacks	FLUK: Protecting Federated Learning against Malicious Clients for Internet of Vehicles
Polynomial Multiplication	A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE
Post-Quantum Cryptography	A Folded Computation-in-Memory Accelerator for Fast Polynomial Multiplication in BIKE
Power capping	Light-weight prediction for improving energy consumption in HPC platforms (Artifact)
Power-Law Graph	Vectorizing Sparse Blocks of Graph Matrices for SpMV
precedence constraint	Makespan Minimization for Scheduling on Heterogeneous Platforms with Precedence Constraints
Prefetching	Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
Privacy Preservation	PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds
Privacy Protection	FedGG: Leveraging Generative Adversarial Networks and Gradient Smoothing for Privacy Protection in Federated Learning Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU Lightweight Byzantine-Robust and Privacy-Preserving Federated Learning
Processing in Memory	(re)Assessing PiM Effectiveness for Sequence Alignment
Processor micro-architectures	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Programming Abstractions	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
Pruning	WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators Athena: Add More Intelligence to RMT-based Network Data Plane with Low-bit Quantization
PyCOMPSs	GPU Cache System for COMPSs: A Task-Based Distributed Computing Framework Making easier the life-cycle management of complex application workflows
Q
Quantization	Athena: Add More Intelligence to RMT-based Network Data Plane with Low-bit Quantization
Quotienting	Compact Parallel Hash Tables on the GPU (Artifact)
R
randomized algorithms	Mixed precision randomized low-rank approximation with GPU tensor cores
RDMA	Efficient RNIC Cache Side-channel Attack Detection through DPU-driven Architecture
Redis	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
Reference counting	A Fast Wait-Free Solution to Read-Reclaim Races in Reference Counting (Artifact)
relaxed semantics	How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures (Artifact)
Reliability Engineering	LogRCA: Log-based Root Cause Analysis for Distributed Services
Resistive random access memory	Watt: A Write-optimized RRAM-based Accelerator for Attention
resource allocation	Scheduling distributed I/O resources in HPC systems
Resource Management	DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling Light-weight prediction for improving energy consumption in HPC platforms (Artifact)
response time analysis	Deadline-driven Enhancements and Response Time Analysis of ROS2 Multi-threaded Executors
Restricted Assignment	Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact)
Reverse Time Migration	Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs
RF circuit simulation	Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation
RISC-V (RVV)	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
RMT Pipeline	Athena: Add More Intelligence to RMT-based Network Data Plane with Low-bit Quantization
Roofline model	Predicting GPU kernel's performance on upcoming architectures
ROOT	Parallel Writing of Nested Data in Columnar Formats (Artifact)
Root Cause Analysis	LogRCA: Log-based Root Cause Analysis for Distributed Services
ROS2 Multi-threaded Executor	Deadline-driven Enhancements and Response Time Analysis of ROS2 Multi-threaded Executors
ROSS	Distributed Simulation for Digital Twins of Large-Scale Real-World DiffServ-Based Networks
RSIC-V	ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
RTL simulation	TaroRTL: Accelerating RTL Simulation using Coroutine-based Heterogeneous Task Graph Scheduling
S
satellite constellation	Hurry: Dynamic Collaborative Framework For Low-orbit Mega-Constellation Data Downloading
satellite downloading	Hurry: Dynamic Collaborative Framework For Low-orbit Mega-Constellation Data Downloading
satellite network	Hurry: Dynamic Collaborative Framework For Low-orbit Mega-Constellation Data Downloading
Scheduling	Makespan Minimization for Scheduling on Heterogeneous Platforms with Precedence Constraints Efficient Coupling Streaming AI and Ensemble Simulations on HPC Clusters PriCE: Privacy-Preserving and Cost-Effective Scheduling for Parallelizing the Large Medical Image Processing Workflow over Hybrid Clouds Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores (Artifact) Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems Scheduling distributed I/O resources in HPC systems TaroRTL: Accelerating RTL Simulation using Coroutine-based Heterogeneous Task Graph Scheduling
Scheduling with rejection	A 1.25(1+ε)-Approximation Algorithm for Scheduling with Rejection Costs Proportional to Processing Times (Artifact)
Scientific workflows	A Framework for Automated Parallel Execution of Scientific Multi-Workflow Applications in the Cloud with Work Stealing
Self-adaptive	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
sequence alignment	(re)Assessing PiM Effectiveness for Sequence Alignment
Serverful	Context-aware Runtime Type Prediction for Heterogeneous Microservices
Serverless	Context-aware Runtime Type Prediction for Heterogeneous Microservices
Service placement	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Service replication	Optimizing Service Replication and Placement for IoT Applications in Fog Computing Systems
Shared-memory systems	ALZI: An Improved Parallel Algorithm for Finding Connected Components in Large Graphs
SIMD	VLASPH: Smoothed Particle Hydrodynamics on VLA SIMD Architectures Vectorizing Sparse Blocks of Graph Matrices for SpMV ImSPU: Implicit Sharing of Computation Resources between Vector and Scalar Processing Units
SIMD/Vector instructions	Exploring processor micro-architectures optimised for BLAS3 micro-kernels
Similarity	Watt: A Write-optimized RRAM-based Accelerator for Attention
Simulation	Light-weight prediction for improving energy consumption in HPC platforms (Artifact)
Single Shared File	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact)
Software Coupling	Code Generation for Octree-Based Multigrid Solvers with Fused Higher-Order Interpolation and Communication
Sparse Matrix Operations	ESIMD GPU implementations of Deep Learning Sparse Matrix Kernels
Sparse matrix reordering	Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication (Artifact)
sparse matrix-vector product (SpMV)	Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product
SPH	VLASPH: Smoothed Particle Hydrodynamics on VLA SIMD Architectures
SpMM	Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication (Artifact)
SpMV	Vectorizing Sparse Blocks of Graph Matrices for SpMV
Staleness	A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Stencil Computation	Accelerating Stencil Computation with Fully Homomorphic Encryption Using GPU
Storage System	PEANUTS: A Persistent Memory-Based Network Unilateral Transfer System for Enhanced MPI-IO Data Transfer (Artifact)
Stream Processing	MPR: An MPI Framework for Distributed Self-Adaptive Stream Processing
Synchronization	FlexiGran: Flexible Granularity Locking in Hierarchies
Systems performance	Supporting HPC Centers: challenges, horror stories and best practices
Systolic array	FakeGuard: A Novel Accelerator Architecture for Deepfake Detection Networks
T
Task graph parallelism	TaroRTL: Accelerating RTL Simulation using Coroutine-based Heterogeneous Task Graph Scheduling
Tensor Core	PCTC: Hardware and Software Co-Design for Pruned Capsule Networks on Tensor Cores
Tensor Cores	Mixed precision randomized low-rank approximation with GPU tensor cores Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication (Artifact)
Time Aware Shaper	Node Bundle Scheduling: An Ultra-Low Latency Traffic Scheduling Algorithm for TAS-based Time-Sensitive Networks
Time-Sensitive Networking	Node Bundle Scheduling: An Ultra-Low Latency Traffic Scheduling Algorithm for TAS-based Time-Sensitive Networks
Tools interface	A Mechanism to Generate Interception Based Tools for HPC Libraries
Toom-Cook	Communication Minimizing Toom-Cook Algorithms
Toom-Graph	Communication Minimizing Toom-Cook Algorithms
Traffic Scheduling	Node Bundle Scheduling: An Ultra-Low Latency Traffic Scheduling Algorithm for TAS-based Time-Sensitive Networks
Transformers	Inference with Transformer Encoders on ARM and RISC-V Multicore Processors
Translation lookaside buffer	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis
V
Vertex Pruning	ADE-HGNN: Accelerating HGNNs through Attention Disparity Exploitation
Video Encoding	Cloud-native GPU-enabled architecture for parallel video encoding
Vision Transformer	FakeGuard: A Novel Accelerator Architecture for Deepfake Detection Networks
VLA	VLASPH: Smoothed Particle Hydrodynamics on VLA SIMD Architectures
W
Wait-free	A Fast Wait-Free Solution to Read-Reclaim Races in Reference Counting (Artifact)
Work stealing	A Framework for Automated Parallel Execution of Scientific Multi-Workflow Applications in the Cloud with Work Stealing
Workload Characterization	DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling
Workload Prediction	DProbe: Profiling and Predicting Multi-Tenant Deep Learning Workloads for GPU Resource Scaling
workload-aware dynamic scheduler	Watt: A Write-optimized RRAM-based Accelerator for Attention
Wrapper based tools	A Mechanism to Generate Interception Based Tools for HPC Libraries