GPU Tasking (cudaFlowCapturer)
You can create a cudaFlow through stream capture, which allows you to implicitly capture a CUDA graph using stream-based interface. Compared to explicit CUDA Graph construction (tf::
Include the Header
You need to include the header file, taskflow/cuda/cudaflow.hpp
, for capturing a GPU task graph using tf::
#include <taskflow/cuda/cudaflow.hpp>
Capture a cudaFlow
When your program has no access to direct kernel calls but can only invoke them through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::task_1
(my_kernel_1
) and task_2
(my_kernel_2
) , where task_1
runs before task_2
.
// create a cudaFlow capturer to run a CUDA graph using stream capturing tf::cudaFlowCapturer capturer; // capture my_kernel_1 through a stream managed by capturer tf::cudaTask task_1 = capturer.on([&](cudaStream_t stream){ my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1); }).name("my_kernel_1"); // capture my_kernel_2 through a stream managed by capturer tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2); }).name("my_kernel_2"); // my_kernel_1 runs before my_kernel_2 task_1.precede(task_2); // offload captured GPU tasks using the CUDA Graph execution model tf::cudaStream stream; capturer.run(stream); stream.synchronize(); // dump the cudaFlow to a DOT format through std::cout capturer.dump(std::cout)
Common Capture Methods
tf::
tf::cudaFlowCapturer capturer; // copy data from host_data to gpu_data tf::cudaTask h2d = capturer.memcpy(gpu_data, host_data, bytes) .name("h2d"); // capture my_kernel to do computation on gpu_data tf::cudaTask kernel = capturer.kernel(grid, block, shm_size, kernel, kernel_args); .name("my_kernel"); // copy data from gpu_data to host_data tf::cudaTask d2h = capturer.memcpy(host_data, gpu_data, bytes) .name("d2h"); // build task dependencies h2d.precede(kernel); kernel.precede(d2h);
Create a Capturer on a Specific GPU
You can run a cudaFlow capturer on a specific GPU by switching to the context of that GPU using tf::2
:
{ // create an RAII-styled switcher to the context of GPU 2 tf::cudaScopedDevice context(2); // create a cudaFlow capturer under GPU 2 tf::cudaFlowCapturer capturer; // ... // create a stream under GPU 2 and offload the capturer to that GPU tf::cudaStream stream; capturer.run(stream); stream.synchronize(); }
tf::
Create a Capturer from a cudaFlow
Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2
of two dependent tasks, task2_1
and task2_2
, and task2
runs after task1
.
tf::cudaFlow cudaflow; tf::cudaTask task1 = cudaflow.kernel(grid, block, shm, my_kernel, args...) .name("kernel"); // task2 forms a subflow as a child node in the underlying CUDA graph tf::cudaTask task2 = cudaflow.capture([&](tf::cudaFlowCapturer& capturer){ // capture kernel_1 using the given stream tf::cudaTask task2_1 = capturer.on([&](cudaStream_t stream){ kernel_2<<<grid1, block1, shm_size1, stream>>>(args1...); }).name("kernel_1"); // capture kernel_2 using the given stream tf::cudaTask task2_2 = capturer.on([&](cudaStream_t stream){ kernel_2<<<grid2, block2, shm_size2, stream>>>(args2...); }).name("kernel_2"); // kernel_1 runs before kernel_2 task2_1.precede(task2_2); }).name("capturer"); task1.precede(task2);
Offload a cudaFlow Capturer
When you offload a cudaFlow capturer using tf::
tf::cudaStream stream; // launch a cudaflow capturer asynchronously through a stream capturer.run(stream); // wait for the cudaflow to finish stream.synchronize();
Update a cudaFlow Capturer
Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the captured task with a different set of parameters. Every task-creation method in tf::
tf::cudaStream stream; tf::cudaFlowCapturer cf; // create a kernel task tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel, kernel_args_1); cf.run(stream); stream.synchronize(); // update the created kernel task with different parameters cf.kernel(task, grid2, block2, shm2, kernel, kernel_args_2); cf.run(stream); stream.synchronize();
When you run a updated cudaFlow capturer, Taskflow will try to update the underlying executable with the newly captured graph first. If that update is unsuccessful, Taskflow will destroy the executable graph and re-instantiate a new one from the newly captured graph.
Integrate a cudaFlow Capturer into Taskflow
You can create a task to enclose a cudaFlow capturer and run it from a worker thread. The usage of the capturer remains the same except that the capturer is run by a worker thread from a taskflow task. The following example runs a cudaFlow capturer from a static task:
tf::Executor executor; tf::Taskflow taskflow; taskflow.emplace([](){ // create a cudaFlow capturer inside a static task tf::cudaFlowCapturer capturer; // ... capture a GPU task graph capturer.kernel(...); // run the capturer through a stream tf::cudaStream stream; capturer.run(stream); stream.synchronize(); });