Parallel Iterations
tf::
Include the Header
You need to include the header file, taskflow/cuda/algorithm/for_each.hpp
, for creating a parallel-iteration task.
#include <taskflow/cuda/algorithm/for_each.hpp>
Index-based Parallel Iterations
Index-based parallel-for performs parallel iterations over a range [first, last)
with the given step
size. The task created by tf::
// positive step: first, first+step, first+2*step, ... for(auto i=first; i<last; i+=step) { callable(i); } // negative step: first, first-step, first-2*step, ... for(auto i=first; i>last; i+=step) { callable(i); }
Each iteration i
is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier. The following example creates a kernel that assigns each entry of gpu_data
to 1 over the range [0, 100) with step size 1.
// assigns each element in gpu_data to 1 over the range [0, 100) with step size 1 cudaflow.for_each_index(0, 100, 1, [gpu_data] __device__ (int idx) { gpu_data[idx] = 1; });
Iterator-based Parallel Iterations
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first
and last
. The task created by tf::
for(auto i=first; i<last; i++) { callable(*i); }
The two iterators, first
and last
, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each
kernel that assigns each element in gpu_data
to 1 over the range [gpu_data, gpu_data + 1000)
.
// assigns each element to 1 over the range [gpu_data, gpu_data + 1000) cudaflow.for_each(gpu_data, gpu_data + 1000, [] __device__ (int& item) { item = 1; });
Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier.
Miscellaneous Items
The parallel-iteration algorithms are also available in tf::