Overcoming Generalization Gap in Large Mini-Batch Training of Deep Neural Networks
SPEAKER: unknown
ABSTRACT. Large mini-batch training is inevitable for parallelization of the training of deep neural networks. However, it has been empirically found that neural networks trained with large mini-batch generalize poorly. Recent observations suggest that the minima for large mini-batch training tend to be in sharper regions, which are known to have poor generalization ability. In this research, we propose a method to close this generalization gap by introducing Gaussian noise in the gradient during the parameter update of Stochastic Gradient Descent (SGD).
Evaluation of low-latency ring communication technique for reliability of SDN-MPI_Bcast
SPEAKER: unknown
ABSTRACT. Broadcast communication is an essential collective communication in parallel computing on distributed memory HPC systems. Our previous work has demonstrated acceleration of MPI_Bcast by dynamically configuring a delivery-tree path from a source process to others on a Software-Defined Network (SDN), for evaluating the feasibility of dynamic control of packet flows in MPI communication. The prototypic implementation in the previous work has a technical problem in terms of data delivery for practical use, partly because it has used unreliable one-to-many communication. In this research, we have applied a ring communication method for reliability of our SDN-MPI_Bcast implementation. In the ring communication method, each process sends data, using reliable TCP communication, to next neighbour process on the virtual ring topology after receiving data through delivery-tree. The technical feature of the method is to generate virtual ring topology with low-latency and collision-avoided by combining network topology information and physical placements of processes. Evaluation conducted in this research shows that the prototype SDN-MPI_Bcast perform faster with the method’s ring communication than with the process rank based ring topology’s ring communication of the related work.
Inside ABCI: Container-based Software Management for Scalable Distributed Deep Learning
SPEAKER: unknown
ABSTRACT. Deploying distributed deep learning Software requires to manage complicated software dependencies, i.e., CUDA, CuDNN, NCCL, and OpenMPI, etc. However, existing manual installation of these software collections onto shared large-scale computing resources, such as traditional supercomputers, introduces significant elaborate efforts and disturbs to follow state-of-the-art early development deep learning software. Here, we demonstrate container-based software management using Singularity HPC Container and Docker with Univa Grid Engine for ABCI (AI Bridging Cloud Infrastructure) with x4352 NVIDIA Tesla V100 GPUs planning to start the production operation in 2018. Our early prototype supports various distributed deep learning frameworks including ChainerMN, CNTK, Caffe2, MxNet, Horovod-Tensorflow, etc.
vGASNet: Scalable RMA-based Communication Library for Out-of-core Data Processing
SPEAKER: unknown
ABSTRACT. Remote Memory Access (RMA) is known as a methodology to ease distributed programming.
Some interfaces and libraries like MPI-3 and GASNet accommodate RMA functionalities.
Unfortunately, few libraries accommodating RMA functionalities do not support out-of-core data processing.
Therefore, we developed vGASNet, a novel RMA-based communication library supporting out-of-core data processing.
vGASNet considers node-local SSDs as a main memory and part of DRAMs as cache.
For performance improvement, vGASNet adopts a cache mechanism called cooperative-caching.
Cooperative-caching enables each node to access the caches stored in other nodes, not only itself.
In this poster, we introduce cooperative-caching mechanism and its effectiveness.
Additionally, we integrated vGASNet with an existing framework, namely UPC++.
We also introduce our version of UPC++ performance.
An Out-of-core CPU-GPU Cooperative B&B Solver for the Large Knapsack Problem
SPEAKER: unknown
ABSTRACT. We propose an out-of-core CPU-GPU cooperative branch and bound (B&B)
solver for the binary knapsack problem. For the purpose of solving a large
problem that produces many subproblems that exhaust the GPU memory, the
proposed solver dynamically swaps subproblems to CPU memory. We
adopt two strategies to eliminate the data transfer overhead: (1) a GPU-based
stream compaction strategy that reduces the sparseness of arrays, which
minimizes the amount of CPU-GPU data transfer, and (2) a double buffering
strategy that completely hides the data transfer overhead by overlapping data
transfer with GPU-based B&B operations. Furthermore, in order to exploit
CPU cores, we propose a CPU-GPU cooperative scheme that makes CPU cores
process subproblems in parallel, simultaneously with GPU cores.
Runtime GPU Memory Optimization for Supporting Large Neural Networks on Chainer
SPEAKER: unknown
ABSTRACT. Neural networks (NNs) that are computational models composed of multiple layers have achieved high accuracy in many fields. In order to accelerate NN computations, GPUs are widely used in machine learning frameworks such as Chainer. However, the problem sizes of NNs that can be computed are limited by GPU memory capacity. A general approach for processing data exceeding GPU memory capacity is to swap out data to CPU memory. The overhead of data movement cannot be ignored, thus reducing this overhead is an important issue. This poster describes the design and implementation of an extension of Chainer, which supports to compute NN exceeding GPU memory capacity using CPU memory. As our basic approach, the data of each layer are swapped between CPU memory and GPU memory. In addition, to reduce communication overhead, which data to swap and the timing of each swap are optimized based on runtime profiling. We successfully computed NN requiring more than 56 GB memory on a single GPU with 16 GB memory. Compared with the original Chainer, performance degradation was less than 14 %.