tf::cudaFlowRoundRobinOptimizer class

class to capture a CUDA graph using a round-robin algorithm

A round-robin capturing algorithm levelizes the user-described graph and assign streams to nodes in a round-robin order level by level. The algorithm is based on the following paper published in Euro-Par 2021:

  • Dian-Lun Lin and Tsung-Wei Huang, "Efficient GPU Computation using Task Graph Parallelism," European Conference on Parallel and Distributed Computing (Euro-Par), 2021

The round-robin optimization algorithm is best suited for large cudaFlow graphs that compose hundreds of or thousands of GPU operations (e.g., kernels and memory copies) with many of them being able to run in parallel. You can configure the number of streams to the optimizer to adjust the maximum kernel currency in the captured CUDA graph.

Constructors, destructors, conversion operators

cudaFlowRoundRobinOptimizer() defaulted
constructs a round-robin optimizer with 4 streams by default
cudaFlowRoundRobinOptimizer(size_t num_streams) explicit
constructs a round-robin optimizer with the given number of streams

Public functions

auto num_streams() const -> size_t
queries the number of streams used by the optimizer
void num_streams(size_t n)
sets the number of streams used by the optimizer