Parallel Iterations
Taskflow provides standard template methods for performing parallel iterations over a range of items a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/for_each.hpp
, for using the parallel-iteration algorithm.
#include <taskflow/cuda/algorithm/for_each.hpp>
Index-based Parallel Iterations
Index-based parallel-for performs parallel iterations over a range [first, last)
with the given step
size. The task created by tf::
// positive step: first, first+step, first+2*step, ... for(auto i=first; i<last; i+=step) { callable(i); } // negative step: first, first-step, first-2*step, ... for(auto i=first; i>last; i+=step) { callable(i); }
Each iteration i
is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each entry of data
to 1 over the range [0, 100) with step size 1.
tf::cudaDefaultExecutionPolicy policy; auto data = tf::cuda_malloc_shared<int>(100); // assigns each element in data to 1 over the range [0, 100) with step size 1 tf::cuda_for_each_index( policy, 0, 100, 1, [data] __device__ (int idx) { data[idx] = 1; } ); // synchronize the execution policy.synchronize();
The parallel-iteration algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results.
Iterator-based Parallel Iterations
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first
and last
. The task created by tf::
for(auto i=first; i<last; i++) { callable(*i); }
The two iterators, first
and last
, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each
kernel that assigns each element in gpu_data
to 1 over the range [data, data + 1000)
.
tf::cudaDefaultExecutionPolicy policy; auto data = tf::cuda_malloc_shared<int>(1000); // assigns each element in data to 1 over the range [0, 1000) with step size 1 tf::cuda_for_each( policy, data, data + 1000, [] __device__ (int& item) { item = 1; } ); // synchronize the execution policy.synchronize();
Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier.