class
#include <taskflow/cuda/cudaflow.hpp>
cudaFlow class to create a cudaFlow task dependency graph
A cudaFlow is a high-level interface over CUDA Graph to perform GPU operations using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks. The following example creates a cudaFlow of two kernel tasks, task1
and task2
, where task1
runs before task2
.
tf::cudaStream stream; tf::cudaFlow cf; // create two kernel tasks tf::cudaTask task1 = cf.kernel(grid1, block1, shm_size1, kernel1, args1); tf::cudaTask task2 = cf.kernel(grid2, block2, shm_size2, kernel2, args2); // kernel1 runs before kernel2 task1.precede(task2); // create an executable graph from the cudaflow cudaGraphExec exec = cf.instantiate(); // run the executable graph through the given stream exec.run(stream);
Please refer to GPU Tasking (cudaFlow) for details.
Base classes
-
template<typename Creator, typename Deleter>class cudaGraphBase<cudaGraphCreator, cudaGraphDeleter>
- class to create a CUDA graph managed by C++ smart pointer
Constructors, destructors, conversion operators
Public functions
-
template<typename C>auto single_task(C c) -> cudaTask
- runs a callable with only a single kernel thread
-
template<typename C>void single_task(cudaTask task, C c)
- updates a single-threaded kernel task
-
template<typename I, typename C>auto for_each(I first, I last, C callable) -> cudaTask
- applies a callable to each dereferenced element of the data array
-
template<typename I, typename C>void for_each(cudaTask task, I first, I last, C callable)
- updates parameters of a kernel task created from tf::
cudaFlow:: for_each -
template<typename I, typename C>auto for_each_index(I first, I last, I step, C callable) -> cudaTask
- applies a callable to each index in the range with the step size
-
template<typename I, typename C>void for_each_index(cudaTask task, I first, I last, I step, C callable)
- updates parameters of a kernel task created from tf::
cudaFlow:: for_each_index -
template<typename I, typename O, typename C>auto transform(I first, I last, O output, C op) -> cudaTask
- applies a callable to a source range and stores the result in a target range
-
template<typename I, typename O, typename C>void transform(cudaTask task, I first, I last, O output, C c)
- updates parameters of a kernel task created from tf::
cudaFlow:: transform -
template<typename I1, typename I2, typename O, typename C>auto transform(I1 first1, I1 last1, I2 first2, O output, C op) -> cudaTask
- creates a task to perform parallel transforms over two ranges of items
-
template<typename I1, typename I2, typename O, typename C>void transform(cudaTask task, I1 first1, I1 last1, I2 first2, O output, C c)
- updates parameters of a kernel task created from tf::
cudaFlow:: transform -
template<typename C>auto capture(C&& callable) -> cudaTask
- constructs a subflow graph through tf::
cudaFlowCapturer -
template<typename C>void capture(cudaTask task, C callable)
- updates the captured child graph
Function documentation
tf:: cudaFlow:: cudaFlow() defaulted
constructs a cudaFlow
A cudaFlow is associated with a tf::
template<typename C>
cudaTask tf:: cudaFlow:: single_task(C c)
runs a callable with only a single kernel thread
Template parameters | |
---|---|
C | callable type |
Parameters | |
c | callable to run by a single kernel thread |
Returns | a tf:: |
template<typename C>
void tf:: cudaFlow:: single_task(cudaTask task,
C c)
updates a single-threaded kernel task
This method is similar to cudaFlow::
template<typename I, typename C>
cudaTask tf:: cudaFlow:: for_each(I first,
I last,
C callable)
applies a callable to each dereferenced element of the data array
Template parameters | |
---|---|
I | iterator type |
C | callable type |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
callable | a callable object to apply to the dereferenced iterator |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
for(auto itr = first; itr != last; itr++) { callable(*itr); }
template<typename I, typename C>
void tf:: cudaFlow:: for_each(cudaTask task,
I first,
I last,
C callable)
updates parameters of a kernel task created from tf::
The type of the iterators and the callable must be the same as the task created from tf::
template<typename I, typename C>
cudaTask tf:: cudaFlow:: for_each_index(I first,
I last,
I step,
C callable)
applies a callable to each index in the range with the step size
Template parameters | |
---|---|
I | index type |
C | callable type |
Parameters | |
first | beginning index |
last | last index |
step | step size |
callable | the callable to apply to each element in the data array |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
// step is positive [first, last) for(auto i=first; i<last; i+=step) { callable(i); } // step is negative [first, last) for(auto i=first; i>last; i+=step) { callable(i); }
template<typename I, typename C>
void tf:: cudaFlow:: for_each_index(cudaTask task,
I first,
I last,
I step,
C callable)
updates parameters of a kernel task created from tf::
The type of the iterators and the callable must be the same as the task created from tf::
template<typename I, typename O, typename C>
cudaTask tf:: cudaFlow:: transform(I first,
I last,
O output,
C op)
applies a callable to a source range and stores the result in a target range
Template parameters | |
---|---|
I | input iterator type |
O | output iterator type |
C | unary operator type |
Parameters | |
first | iterator to the beginning of the input range |
last | iterator to the end of the input range |
output | iterator to the beginning of the output range |
op | the operator to apply to transform each element in the range |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first != last) { *output++ = callable(*first++); }
template<typename I, typename O, typename C>
void tf:: cudaFlow:: transform(cudaTask task,
I first,
I last,
O output,
C c)
updates parameters of a kernel task created from tf::
The type of the iterators and the callable must be the same as the task created from tf::
template<typename I1, typename I2, typename O, typename C>
cudaTask tf:: cudaFlow:: transform(I1 first1,
I1 last1,
I2 first2,
O output,
C op)
creates a task to perform parallel transforms over two ranges of items
Template parameters | |
---|---|
I1 | first input iterator type |
I2 | second input iterator type |
O | output iterator type |
C | unary operator type |
Parameters | |
first1 | iterator to the beginning of the input range |
last1 | iterator to the end of the input range |
first2 | iterato |
output | iterator to the beginning of the output range |
op | binary operator to apply to transform each pair of items in the two input ranges |
Returns | cudaTask handle |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first1 != last1) { *output++ = op(*first1++, *first2++); }
template<typename I1, typename I2, typename O, typename C>
void tf:: cudaFlow:: transform(cudaTask task,
I1 first1,
I1 last1,
I2 first2,
O output,
C c)
updates parameters of a kernel task created from tf::
The type of the iterators and the callable must be the same as the task created from tf::
template<typename C>
cudaTask tf:: cudaFlow:: capture(C&& callable)
constructs a subflow graph through tf::
Template parameters | |
---|---|
C | callable type constructible from std::function<void(tf::cudaFlowCapturer&)> |
Parameters | |
callable | the callable to construct a capture flow |
Returns | a tf:: |
A captured subflow forms a sub-graph to the cudaFlow and can be used to capture custom (or third-party) kernels that cannot be directly constructed from the cudaFlow.
Example usage:
taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask my_kernel = cf.kernel(my_arguments); // create a flow capturer to capture custom kernels tf::cudaTask my_subflow = cf.capture([&](tf::cudaFlowCapturer& capturer){ capturer.on([&](cudaStream_t stream){ invoke_custom_kernel_with_stream(stream, custom_arguments); }); }); my_kernel.precede(my_subflow); });
template<typename C>
void tf:: cudaFlow:: capture(cudaTask task,
C callable)
updates the captured child graph
The method is similar to tf::