CFP
ScaDL 2019: Scalable Deep Learning over Parallel and Distributed Infrastructures |
Website | https://sites.google.com/site/scadlworkshop/ |
Submission link | https://easychair.org/conferences/?conf=scadl2019 |
Abstract registration deadline | January 25, 2019 |
Submission deadline | January 25, 2019 |
Scope of the Workshop
Recently, Deep Learning (DL) has received tremendous attention in the research community because of the impressive results obtained for a large number of machine learning problems. The success of state-of-the-art deep learning systems relies on training deep neural networks over a massive amount of training data, which typically requires a large-scale distributed computing infrastructure to run. In order to run these jobs in a scalable and efficient manner, on cloud infrastructure or dedicated HPC systems, several interesting research topics have emerged which are specific to DL. The sheer size and complexity of deep learning models when trained over a large amount of data makes them harder to converge in a reasonable amount of time. It demands advancement along multiple research directions such as, model/data parallelism, model/data compression, distributed optimization algorithms for DL convergence, synchronization strategies, efficient communication and specific hardware acceleration.
In order to provide a few concrete examples, we seek to advance the following pertinent research directions:
Recently, Deep Learning (DL) has received tremendous attention in the research community because of the impressive results obtained for a large number of machine learning problems. The success of state-of-the-art deep learning systems relies on training deep neural networks over a massive amount of training data, which typically requires a large-scale distributed computing infrastructure to run. In order to run these jobs in a scalable and efficient manner, on cloud infrastructure or dedicated HPC systems, several interesting research topics have emerged which are specific to DL. The sheer size and complexity of deep learning models when trained over a large amount of data makes them harder to converge in a reasonable amount of time. It demands advancement along multiple research directions such as, model/data parallelism, model/data compression, distributed optimization algorithms for DL convergence, synchronization strategies, efficient communication and specific hardware acceleration.
In order to provide a few concrete examples, we seek to advance the following pertinent research directions:
- Asynchronous and Communication-Efficient SGD: Stochastic gradient descent is at the core of large-scale machine learning. Parallelizing SGD gradient computation across multiple nodes increases the data processed per iteration, but exposes the SGD to communication and synchronization delays and unpredictable node failures in the system. Thus, there is a critical need to design robust and scalable distributed SGD methods to achieve fast error-convergence in spite of such system variabilities.
- High performance computing aspects: Deep learning is highly compute intensive. Algorithms for kernel computations on commonly used accelerators (e.g. GPUs), efficient techniques for communicating gradients and loading data from storage are critical for training performance.
- Model and Gradient Compression Techniques: Techniques such as reducing weights and the size of weight tensors help in reducing the compute complexity. Using lower-bit representations allow for more optimal use of memory and communication bandwidth.
Topics of Interest
In this workshop we solicit research papers focused on distributed deep learning aiming to achieve efficiency and scalability for deep learning jobs over distributed and parallel systems. Papers focusing both on algorithms as well as systems are welcome. We invite authors to submit papers up to 10 pages in length in IEEE conference format. Relevant topics include but are not limited to:
- Deep learning on HPC systems
- Deep learning for edge devices
- Model-parallel and data-parallel techniques
- Asynchronous SGD for Training DNNs
- Communication-Efficient Training of DNNs
- Model/data/gradient compression
- Learning in Resource constrained environments
- Coding Techniques for Straggler Mitigation
- Elasticity for deep learning jobs/spot market enablement
- Hyper-parameter tuning for deep learning jobs
- Hardware Acceleration for Deep Learning
- Scalability of deep learning jobs on large number of nodes
- Deep learning on heterogeneous infrastructure
- Efficient and Scalable Inference
- Data storage/access in shared networks for deep learning jobs
Committees
General Chairs
- Gauri Joshi, Carnegie Mellon University (gaurijATandrew.cmu.edu)
- Ashish Verma, IBM Research AI (ashish.verma1ATus.ibm.com)
Program Chairs
- Yogish Sabharwal, IBM Research AI
- Parijat Dube, IBM Research AI
Local Chair
- Eduardo Rodrigues, IBM Research
Steering Committee
- Vijay K. Garg, University of Texas at Austin
- Vinod Muthuswamy, IBM Research AI
Program Committee
- Alvaro Coutinho - Federal University of Rio de Janeiro
- Dimitris Papailiopoulos, University of of Wisconsin-Madison
- Esteban Meneses, Costa Rica Institute of Technology
- Kangwook Lee, KAIST
- Li Zhang, IBM Research
- Lydia Chen, TU Delft
- Philippe Navaux, University of Rio Grande do Sul
- Rahul Garg, Indian Institute of Technology Delhi
- Vikas Sindhwani, Google Brain
- Wei Zhang, IBM Research
- Xiangru Lian, University of Rochester
Contact
All questions about submissions should be emailed to Parijat Dube (pdubeATus.ibm.com)