| ||||
| ||||
![]() Title:Fault-Propagation Analysis of GPU Tensor Cores for Machine Learning in Space Conference:SMC-IT/SCC 2025 Tags:COTS, Fault Injection, GPUs, Machine Learning, Onboard Processing, Reliability and Tensor Cores Abstract: As the space industry looks to increase onboard processing in future missions, graphics processing units (GPUs) are becoming prominent. However, GPUs are relatively new to the domain of space and therefore the resiliency of their microarchitectures to radiation is not fully understood. NVIDIA GPUs, in particular, often use specialized processing units called Tensor Cores to accelerate data-intensive calculations embedded within many machine-learning models. The integration of Tensor Cores into spacecraft can significantly increase the performance of onboard training and inference tasks. Through software-based fault injection, the reliability of Tensor Cores within NVIDIA GPUs is analyzed to determine how radiation effects could degrade their performance. Using methods adapted from NVIDIA’s fault injection tool, transient faults are injected into Tensor Core kernels to simulate such radiation effects. A fault-injection campaign is run on kernels found within common space applications, including image classification and semantic segmentation. Results show that faults can spread when performing computations and errors continue to persist in future operations. Additionally, the injection campaign exposes fault patterns in Tensor Cores and illustrates realized faults in semantic segmentation layers. The persistence and severity of faults justifies the need for fault-mitigation strategies for commercial-off-the-shelf devices to enable reliable machine learning in space applications. Fault-Propagation Analysis of GPU Tensor Cores for Machine Learning in Space ![]() Fault-Propagation Analysis of GPU Tensor Cores for Machine Learning in Space | ||||
Copyright © 2002 – 2025 EasyChair |