class
#include <taskflow/cuda/cuda_optimizer.hpp>
cudaFlowRoundRobinOptimizer class to capture a CUDA graph using a round-robin algorithm
A round-robin capturing algorithm levelizes the user-described graph and assign streams to nodes in a round-robin order level by level. The algorithm is based on the following paper published in Euro-Par 2021:
- Dian-Lun Lin and Tsung-Wei Huang, "Efficient GPU Computation using Task Graph Parallelism," European Conference on Parallel and Distributed Computing (Euro-Par), 2021
The round-robin optimization algorithm is best suited for large cudaFlow graphs that compose hundreds of or thousands of GPU operations (e.g., kernels and memory copies) with many of them being able to run in parallel. You can configure the number of streams to the optimizer to adjust the maximum kernel currency in the captured CUDA graph.
Constructors, destructors, conversion operators
- cudaFlowRoundRobinOptimizer() defaulted
- constructs a round-robin optimizer with 4 streams by default
- cudaFlowRoundRobinOptimizer(size_t num_streams) explicit
- constructs a round-robin optimizer with the given number of streams
Public functions
- auto num_streams() const -> size_t
- queries the number of streams used by the optimizer
- void num_streams(size_t n)
- sets the number of streams used by the optimizer