Tags:Depthwise separable convolution, hardware accelerator, memory access and PE utilization
Abstract:
In this paper, we present a hardware accelerator for DSC that enables 100% utilization of the processing element (PE) array for depthwise convolution (DWC) and up to 98% utilization for pointwise convolution (PWC), while also reducing latency. By partitioning the input feature map (ifmap) SRAM of the DWC into three banks, we minimize memory access and maximize data reuse. The input activations and weights only need to be loaded once from SRAM to PE for both DWC and PWC. Additionally, to support efficient operations across different layers, we present a layerwise matching method. The proposed DSC accelerator is implemented in 22nm FDSOI technology and validated using MobileNetV1 on the CIFAR10 dataset. The post-layout results demonstrate that the proposed accelerator can operate at 1GHz and achieve an energy efficiency of 5.07 (3.96) TOPS/W and an area efficiency of 519.2 (461.52) GOPS/mm2 for DWC (PWC) at 0.8V. After scaling the supply voltage down to 0.5V, the energy efficiency for the proposed accelerator increases to 13.64 TOPS/W for DWC and 10.64 TOPS/W for PWC, respectively.
An Energy-Efficient and Area-Efficient Depthwise Separable Convolution Accelerator with Minimal on-Chip Memory Access