The model size and computation requirement of Deep Convolutional Neural Networks (DNNs) have ever increased as their applications to various real-life use-cases, e.g., autonomous driving, are getting more pervasive and popular. While DNN workloads are executed on Graphics Processing Units (GPUs) in many cases, it is not trivial to improve the inference speed through the conventional DNN weight pruning technique, due to the parallel architecture of GPUs. On the other hand, the coarse-grain pruning, also known as structured sparsity or structured pruning, can speedup the inference, but cause significant losses of accuracy. In this paper, we propose two fine-grain DNN pruning techniques that are aware of the underlying GPU architecture. The hierarchical architecture of parallel processing elements and memory of GPU are analyzed to enable the finest possible pruning where the removed weights can be safely skipped during the inference. The effectiveness of the proposed techniques has been evaluated with VGG16. Compared to existing pruning techniques, the proposed methods result in significantly improved inference speed with less accuracy drop.
A GPU Architecture Aware Fine-Grain Pruning Technique for Deep Neural Networks