Execution Policy
Taskflow provides standalone template methods for expressing common parallel algorithms on a GPU. Each of these methods is governed by an execution policy object to configure the kernel execution parameters.
Include the Header
You need to include the header file, taskflow/cuda/cudaflow.hpp
, for creating a CUDA execution policy object.
#include <taskflow/cuda/cudaflow.hpp>
Parameterize Performance
Taskflow parameterizes most CUDA algorithms in terms of the number of threads per block and units of work per thread, which can be specified in the execution policy template type, tf::
Define an Execution Policy
The following example defines an execution policy object, policy
, which configures (1) each block to invoke 512 threads and (2) each of these 512
threads to perform 11
units of work. Block size must be a power of two. It is always a good idea to specify an odd number in the second parameter to avoid bank conflicts.
tf::cudaExecutionPolicy<512, 11> policy;
By default, the execution policy object is associated with the CUDA default stream (i.e., 0). Default stream can incur significant overhead due to the global synchronization. You can associate an execution policy with another stream as shown below:
// create a RAII-styled stream object tf::cudaStream stream1, stream2; // assign a stream to a policy at construction time tf::cudaExecutionPolicy<512, 11> policy(stream1); // assign another stream to the policy policy.stream(stream2);
All the CUDA standard algorithms in Taskflow are asynchronous with respect to the stream assigned to the execution policy. This enables high execution efficiency for large GPU workloads that call for many different algorithms. You can synchronize the stream the block until all tasks in the stream finish:
cudaStreamSynchronize(policy.stream());
The best-performing configurations for each algorithm, each GPU architecture, and each data type can vary significantly. You should experiment different configurations and find the optimal tuning parameters for your applications. A default policy is given in tf::
tf::cudaDefaultExecutionPolicy default_policy;
Allocate Memory Buffer for Algorithms
A key difference between our CUDA standard algorithms and others (e.g., Thrust) is the memory management. Unlike CPU-parallel algorithms, many GPU-parallel algorithms require extra buffer to store the temporary results during the multi-phase computation, for instance, tf::