Download PDFOpen PDF in browser

Deep Learning Workload Performance Auto-Optimizer

EasyChair Preprint no. 2817

4 pagesDate: February 29, 2020


The industry has seen a wave of new domain-specific accelerators purpose-built for deep learning workloads. To obtain real-world performance close to the highest theoretical performance from the accelerators, the tensor layout and workload distribution need to be optimized along with the accelerator instruction set, communication fabric, and memory architecture. In this paper, we introduce a general methodology for automating hardware architecture and software co-optimization for domain-specific accelerators. Applying this methodology to The Intel® Nervana™ Neural Network Processor for Training (Intel® Nervana™ NNP-T), it has achieved the state-of-the-art (SOTA) deep-learning microbenchmark performance on convolution benchmarks. A generic convolution context distribution algorithm developed based on auto-optimizer results for ResNet50 is also discussed in this paper.

Keyphrases: deep learning, Domain-Specific Accelerator, hardware-software co-optimization, locality, parallelism

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Connie Y. Miao and Andrew Yang and Michael J. Anderson},
  title = {Deep Learning Workload Performance Auto-Optimizer},
  howpublished = {EasyChair Preprint no. 2817},

  year = {EasyChair, 2020}}
Download PDFOpen PDF in browser