tf::cudaFlow class

class to create a cudaFlow task dependency graph

A cudaFlow is a high-level interface over CUDA Graph to perform GPU operations using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks. The following example creates a cudaFlow of two kernel tasks, task1 and task2, where task1 runs before task2.

tf::cudaStream stream;
tf::cudaFlow cf;

// create two kernel tasks
tf::cudaTask task1 = cf.kernel(grid1, block1, shm_size1, kernel1, args1);
tf::cudaTask task2 = cf.kernel(grid2, block2, shm_size2, kernel2, args2);

// kernel1 runs before kernel2
task1.precede(task2);

// create an executable graph from the cudaflow
cudaGraphExec exec = cf.instantiate();

// run the executable graph through the given stream
exec.run(stream);

Please refer to GPU Tasking (cudaFlow) for details.

Base classes

template<typename Creator, typename Deleter>
class cudaGraphBase<cudaGraphCreator, cudaGraphDeleter>
class to create a CUDA graph managed by C++ smart pointer

Constructors, destructors, conversion operators

cudaFlow() defaulted
constructs a cudaFlow
~cudaFlow() defaulted
destroys the cudaFlow

Public functions

template<typename C>
auto single_task(C c) -> cudaTask
runs a callable with only a single kernel thread
template<typename C>
void single_task(cudaTask task, C c)
updates a single-threaded kernel task
template<typename I, typename C>
auto for_each(I first, I last, C callable) -> cudaTask
applies a callable to each dereferenced element of the data array
template<typename I, typename C>
void for_each(cudaTask task, I first, I last, C callable)
updates parameters of a kernel task created from tf::cudaFlow::for_each
template<typename I, typename C>
auto for_each_index(I first, I last, I step, C callable) -> cudaTask
applies a callable to each index in the range with the step size
template<typename I, typename C>
void for_each_index(cudaTask task, I first, I last, I step, C callable)
updates parameters of a kernel task created from tf::cudaFlow::for_each_index
template<typename I, typename O, typename C>
auto transform(I first, I last, O output, C op) -> cudaTask
applies a callable to a source range and stores the result in a target range
template<typename I, typename O, typename C>
void transform(cudaTask task, I first, I last, O output, C c)
updates parameters of a kernel task created from tf::cudaFlow::transform
template<typename I1, typename I2, typename O, typename C>
auto transform(I1 first1, I1 last1, I2 first2, O output, C op) -> cudaTask
creates a task to perform parallel transforms over two ranges of items
template<typename I1, typename I2, typename O, typename C>
void transform(cudaTask task, I1 first1, I1 last1, I2 first2, O output, C c)
updates parameters of a kernel task created from tf::cudaFlow::transform
template<typename C>
auto capture(C&& callable) -> cudaTask
constructs a subflow graph through tf::cudaFlowCapturer
template<typename C>
void capture(cudaTask task, C callable)
updates the captured child graph

Function documentation

tf::cudaFlow::cudaFlow() defaulted

constructs a cudaFlow

A cudaFlow is associated with a tf::cudaGraph that manages a native CUDA graph.

template<typename C>
cudaTask tf::cudaFlow::single_task(C c)

runs a callable with only a single kernel thread

Template parameters
C callable type
Parameters
c callable to run by a single kernel thread
Returns a tf::cudaTask handle

template<typename C>
void tf::cudaFlow::single_task(cudaTask task, C c)

updates a single-threaded kernel task

This method is similar to cudaFlow::single_task but operates on an existing task.

template<typename I, typename C>
cudaTask tf::cudaFlow::for_each(I first, I last, C callable)

applies a callable to each dereferenced element of the data array

Template parameters
I iterator type
C callable type
Parameters
first iterator to the beginning (inclusive)
last iterator to the end (exclusive)
callable a callable object to apply to the dereferenced iterator
Returns a tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

for(auto itr = first; itr != last; itr++) {
  callable(*itr);
}

template<typename I, typename C>
void tf::cudaFlow::for_each(cudaTask task, I first, I last, C callable)

updates parameters of a kernel task created from tf::cudaFlow::for_each

The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each.

template<typename I, typename C>
cudaTask tf::cudaFlow::for_each_index(I first, I last, I step, C callable)

applies a callable to each index in the range with the step size

Template parameters
I index type
C callable type
Parameters
first beginning index
last last index
step step size
callable the callable to apply to each element in the data array
Returns a tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

// step is positive [first, last)
for(auto i=first; i<last; i+=step) {
  callable(i);
}

// step is negative [first, last)
for(auto i=first; i>last; i+=step) {
  callable(i);
}

template<typename I, typename C>
void tf::cudaFlow::for_each_index(cudaTask task, I first, I last, I step, C callable)

updates parameters of a kernel task created from tf::cudaFlow::for_each_index

The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each_index.

template<typename I, typename O, typename C>
cudaTask tf::cudaFlow::transform(I first, I last, O output, C op)

applies a callable to a source range and stores the result in a target range

Template parameters
I input iterator type
O output iterator type
C unary operator type
Parameters
first iterator to the beginning of the input range
last iterator to the end of the input range
output iterator to the beginning of the output range
op the operator to apply to transform each element in the range
Returns a tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first != last) {
  *output++ = callable(*first++);
}

template<typename I, typename O, typename C>
void tf::cudaFlow::transform(cudaTask task, I first, I last, O output, C c)

updates parameters of a kernel task created from tf::cudaFlow::transform

The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each.

template<typename I1, typename I2, typename O, typename C>
cudaTask tf::cudaFlow::transform(I1 first1, I1 last1, I2 first2, O output, C op)

creates a task to perform parallel transforms over two ranges of items

Template parameters
I1 first input iterator type
I2 second input iterator type
O output iterator type
C unary operator type
Parameters
first1 iterator to the beginning of the input range
last1 iterator to the end of the input range
first2 iterato
output iterator to the beginning of the output range
op binary operator to apply to transform each pair of items in the two input ranges
Returns cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first1 != last1) {
  *output++ = op(*first1++, *first2++);
}

template<typename I1, typename I2, typename O, typename C>
void tf::cudaFlow::transform(cudaTask task, I1 first1, I1 last1, I2 first2, O output, C c)

updates parameters of a kernel task created from tf::cudaFlow::transform

The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each.

template<typename C>
cudaTask tf::cudaFlow::capture(C&& callable)

constructs a subflow graph through tf::cudaFlowCapturer

Template parameters
C callable type constructible from std::function<void(tf::cudaFlowCapturer&)>
Parameters
callable the callable to construct a capture flow
Returns a tf::cudaTask handle

A captured subflow forms a sub-graph to the cudaFlow and can be used to capture custom (or third-party) kernels that cannot be directly constructed from the cudaFlow.

Example usage:

taskflow.emplace([&](tf::cudaFlow& cf){

  tf::cudaTask my_kernel = cf.kernel(my_arguments);

  // create a flow capturer to capture custom kernels
  tf::cudaTask my_subflow = cf.capture([&](tf::cudaFlowCapturer& capturer){
    capturer.on([&](cudaStream_t stream){
      invoke_custom_kernel_with_stream(stream, custom_arguments);
    });
  });

  my_kernel.precede(my_subflow);
});

template<typename C>
void tf::cudaFlow::capture(cudaTask task, C callable)

updates the captured child graph

The method is similar to tf::cudaFlow::capture but operates on a task of type tf::cudaTaskType::SUBFLOW. The new captured graph must be topologically identical to the original captured graph.