Graphics processing unit (GPU) has become central to today’s scientific computing applications, such as machine learning and simulation. As the application complexity continues to grow, the need to quickly execute thousands of dependent GPU tasks has become a major bottleneck in the development flow. To overcome this challenge, modern CUDA has introduced CUDA Graph for users to directly offload a GPU task graph on a GPU to minimize scheduling overheads. However, programming CUDA Graph is extremely tedious and involves many low-level details that are difficult to program correctly. Consequently, we introduce in this paper, cudaFlow, a modern C++ programming model to streamline the building of large GPU workloads using CUDA Graph. cudaFlow enables efficient implementations of GPU decomposition strategies supplied with incremental update methods to express complex GPU algorithms that are hard to execute efficiently by mainstream stream-based models. We have demonstrated the simplicity and efficiency of cudaFlow on large-scale GPU applications composed of thousands of tasks and dependencies.
The talk will cover five major components: 1. What is the new CUDA Graph programming model? 2. Why do we need a C++ programming model for GPU task graph parallelism? 3. Designs, implementations, and deployments of the proposed cudaFlow programming model. 4. Real use cases of cudaFlow and its performance advantages in large GPU workloads. 5. Remarks and roadmap suggestions for the GPU programming community.
By the end of the presentation, the audience will know how to leverage the new GPU task graph parallelism to boost the performance of large-scale GPU applications, such as machine learning and scientific simulations.
PUBLICATION PERMISSIONS: CppCon Organizer provided Coding Tech with the permission to republish CppCon Tech Talks.
CREDITS: CppCon YouTube channel: https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg