Offloading collective communication operations to hardware platforms is increasingly becoming common in the research space. This paper presents offloading All Reduce Sum collective communication operation to programmable logic by utilizing PyTorch Distributed Learning Library. While being widely accessible, this enables existing PyTorch code bases to be adapted with small to no code changes. Programmable Logic handles computations and intermediate communications, reducing the load and the number of communications handled by software. Furthermore, the hardware design is self-contained such that even without the PyTorch based user interfacing, if the relevant data arrives at the hardware, it is capable of carrying out the full reduction and outputting the results. This further enables the hardware design to be used as an intermediary accelerator for edge data. We are utilizing NetFPGA as the hardware platform given its wide availability, enabling users to easily benefit from the accelerations. The current design focuses on All Reduce Sum operations as this is a necessary operation in most distributed learning applications and accounts for a larger fraction in the network latency. Our overall design goes head-to-head with GPU based NCCL at 296.078 us and as a standalone accelerator, computations and communications take only 56.012 us.
Offloading PyTorch Collective Operations to Independent Programmable Logic at the Edge