Tags:Heterogeneous computing, Scheduling and optimization, Task graph parallelism and task graph programming
Abstract:
Recently, CUDA introduces a new task graph programming model, CUDA Graph, to enable efficient launch and execution of GPU work. Users describe a GPU workload in a task graph rather than aggregated GPU operations, allowing the CUDA runtime to perform whole-graph optimization and significantly reduce the kernel call overheads. However, programming CUDA graphs is extremely challenging. Users need to explicitly construct a graph with verbose parameter settings or implicitly capture a graph that requires complex dependency and concurrency management using streams and events. To overcome this challenge, we introduce a lightweight task graph programming framework to enable efficient GPU computation using CUDA Graph. Users can focus on high-level development of dependent GPU operations while leaving all the intricate managements of stream concurrency and event dependency to our optimization algorithm. We have evaluated our framework and demonstrated its promising performance on both micro-benchmarks and a large-scale machine learning workload. The result also shows that our optimization algorithm achieves very comparable performance to an optimally constructed graph and consumes much less GPU resource.
Efficient GPU Computation Using Task Graph Parallelism