Loading...
Searching...
No Matches
Parallel Iterations

Taskflow provides template functions for constructing tasks to perform parallel iterations over ranges of items.

Include the Header

You need to include the header file, taskflow/algorithm/for_each.hpp, for using parallel-iteration algorithms.

#include <taskflow/algorithm/for_each.hpp>

Create an Index-based Parallel-Iteration Task

Index-based parallel-for performs parallel iterations over a range [first, last) with the given step size. The task created by tf::Taskflow::for_each_index(B first, E last, S step, C callable, P part) represents parallel execution of the following loop:

// positive step
for(auto i = first; i < last; i += step) {
callable(i);
}
// negative step
for(auto i = first; i > last; i += step) {
callable(i);
}

We support only integer-based ranges. The range can go in a positive or negative direction.

taskflow.for_each_index(0, 100, 2, [](int i) { }); // 50 iterations with a positive step
taskflow.for_each_index(100, 0, -2, [](int i) { }); // 50 iterations with a negative step

Notice that the direction is defined in terms of the half-open range [first, last), where last is excluded. In the positive case, the 50 items are 0, 2, 4, 6, ..., 96, 98. In the negative case, the 50 items are 100, 98, 96, 94, ..., 4, 2. An example of the Taskflow graph for the positive case under 5 workers is depicted below:

Capture Indices by Reference

You can pass indices by reference using std::ref to marshal parameter updates between dependent tasks. This is useful when the range is not known at the time the task graph is constructed but is initialized by an upstream task.

int first, last;
int* vec;
auto init = taskflow.emplace([&](){
first = 0;
last = 1000;
vec = new int[1000];
});
auto pf = taskflow.for_each_index(
std::ref(first), std::ref(last), 1,
[&](int i) {
std::cout << "parallel iteration on index " << i << '\n';
}
);
// The code below is wrong! first and last are captured by copy at construction time
// auto pf = taskflow.for_each_index(first, last, 1, [&](int i) { });
init.precede(pf);

When init finishes, the parallel-for task pf will see first as 0 and last as 1000 and performs parallel iterations over the 1000 indices.

Create an IndexRange-based Parallel-Iteration Task

While tf::Taskflow::for_each_index gives each worker a single scalar index, tf::Taskflow::for_each_by_index gives each worker a contiguous sub-range (or sub-box in multiple dimensions) of the iteration space. This distinction is significant: instead of processing one element at a time, the callable operates on an entire block of consecutive indices in a single invocation, enabling block algorithms that exploit data locality, facilitate SIMD vectorisation, and reduce scheduling overhead.

What is an IndexRange?

A tf::IndexRange describes a typed, half-open index range [begin, end) with a step size. There are two variants:

  • tf::IndexRange<T> (or equivalently tf::IndexRange<T, 1>): A 1D range of integral indices. It is defined by three parameters: begin, end, and step_size. The elements are begin, begin+step, begin+2*step, ... up to (but not including) end.
// elements: 2, 5, 8, 11 (step = 3, size = 4)
tf::IndexRange<int> r(2, 14, 3);
r.begin(); // 2
r.end(); // 14
r.step_size(); // 3
r.size(); // 4
class to create an N-dimensional index range of integral indices
Definition iterator.hpp:165
  • tf::IndexRange<T, N>: An N-dimensional Cartesian product of N independent 1D ranges, one per dimension. Dimension 0 is the outermost (slowest varying) and dimension N-1 is the innermost (fastest varying), matching the natural nesting of C-style for-loops (row-major order).
// 2D range: i in [0,3), j in [0,4) — a 3x4 grid of 12 index pairs
tf::IndexRange<int>(0, 3, 1), // dim 0: i = 0, 1, 2
tf::IndexRange<int>(0, 4, 1) // dim 1: j = 0, 1, 2, 3
);
r.size(); // 12 (3 x 4)
r.size(0); // 3 (elements along dim 0)
r.size(1); // 4 (elements along dim 1)
r.dim(0); // IndexRange<int>(0, 3, 1)
r.dim(1); // IndexRange<int>(0, 4, 1)

The key advantage of tf::IndexRange<T, N> over a raw scalar index is that the callable receives the full sub-range (or sub-box) at once, allowing it to implement cache-friendly block algorithms, apply SIMD over contiguous indices, or exploit any other block-level optimisation.

Create a Parallel-Iteration Task over a 1D IndexRange

Passing a 1D tf::IndexRange<T> to tf::Taskflow::for_each_by_index creates a parallel task whose callable receives one contiguous sub-range of the original range per invocation:

tf::IndexRange<int> range(0, 100, 1);
taskflow.for_each_by_index(range, [](tf::IndexRange<int> sub) {
// sub is a contiguous slice of [0, 100)
for(int i = sub.begin(); i < sub.end(); i += sub.step_size()) {
process(i);
}
});

Because the callable sees the full sub-range at once, it can implement cache-friendly block algorithms that would be impossible when receiving a single index at a time. For instance, a worker assigned [32, 64) can prefetch that entire cache line before processing, rather than incurring per-element dispatch overhead.

Create a Parallel-Iteration Task over a Multi-dimensional IndexRange

Passing an N-dimensional tf::IndexRange<T, N> to tf::Taskflow::for_each_by_index creates a parallel task whose callable receives one orthogonal sub-box of the full ND space per invocation. A sub-box is a valid hyperrectangle — a contiguous, axis-aligned tile that preserves the original step sizes of every dimension. The figure below shows a 2x3x4 range decomposed into two sub-boxes (one per outer-dimension value), each a 3x4 slice:

For a 2D range, the parallel task represents:

tf::IndexRange<int>(0, rows, 1), // dimension 0: rows
tf::IndexRange<int>(0, cols, 1) // dimension 1: columns
);
taskflow.for_each_by_index(range, [](const tf::IndexRange<int, 2>& tile) {
// tile is an orthogonal sub-box of the full 2D range
for(int i = tile.dim(0).begin(); i < tile.dim(0).end(); i += tile.dim(0).step_size()) {
for(int j = tile.dim(1).begin(); j < tile.dim(1).end(); j += tile.dim(1).step_size()) {
process(i, j);
}
}
});
const IndexRange< T, 1 > & dim(size_t d) const
returns the 1D range for dimension d (read-only)
Definition iterator.hpp:298
T begin() const
queries the starting index of the range
Definition iterator.hpp:944
T step_size() const
queries the step size of the range
Definition iterator.hpp:954
T end() const
queries the ending index of the range
Definition iterator.hpp:949

Each sub-box preserves the original step sizes for every dimension, so the inner loops are identical to what you would write for sequential code. The same pattern extends naturally to three or more dimensions:

tf::IndexRange<int>(0, depth, 1),
tf::IndexRange<int>(0, height, 1),
tf::IndexRange<int>(0, width, 1)
);
taskflow.for_each_by_index(range, [](const tf::IndexRange<int, 3>& sub) {
for(int i = sub.dim(0).begin(); i < sub.dim(0).end(); i += sub.dim(0).step_size()) {
for(int j = sub.dim(1).begin(); j < sub.dim(1).end(); j += sub.dim(1).step_size()) {
for(int k = sub.dim(2).begin(); k < sub.dim(2).end(); k += sub.dim(2).step_size()) {
process(i, j, k);
}
}
}
});

Zero-size Dimensions

When a dimension has zero size, it and all dimensions inner to it produce no iterations — but outer dimensions are unaffected. This matches the behaviour of sequential nested loops: a zero-size inner loop simply never executes, while the outer loops still run.

// The middle dimension has zero iterations (j_begin == j_end).
// The outer i-loop still has work; the j/k body never executes.
tf::IndexRange<int>(0, 100, 1), // i: 100 iterations
tf::IndexRange<int>(0, 0, 1), // j: 0 iterations — body never runs
tf::IndexRange<int>(0, 100, 1) // k: 100 iterations (never reached)
);
// range.size() == 100 (outer i-loop only)
taskflow.for_each_by_index(range, [](const tf::IndexRange<int, 3>& sub) {
// Each sub-box covers a slice of the active i-dimension.
// Only process(i) is ever called — the j-loop never executes,
// so process(i, j) and process(i, j, k) are never reached.
for(int i = sub.dim(0).begin(); i < sub.dim(0).end(); ++i) {
process(i); // <-- called for every i
for(int j = sub.dim(1).begin(); j < sub.dim(1).end(); ++j) { // never entered
process(i, j); // <-- never called
for(int k = sub.dim(2).begin(); k < sub.dim(2).end(); ++k) {
process(i, j, k); // <-- never called
}
}
}
});

This is equivalent to the sequential nested loop:

for(int i = 0; i < 100; ++i) {
process(i); // <-- called for every i
for(int j = 0; j < 0; ++j) { // zero iterations — body skipped
process(i, j); // <-- never called
for(int k = 0; k < 100; ++k) {
process(i, j, k); // <-- never called
}
}
}
Note
This differs from OpenMP's collapse clause, which treats the entire iteration space as a Cartesian product and yields zero total iterations when any dimension is empty. Taskflow preserves the sequential semantics of nested loops: outer dimensions always execute independently of inner ones.

Understand the Scheduling Algorithm

Taskflow schedules parallel iterations over an index range by mapping the N-dimensional space to a 1D flat index space and distributing chunks of that flat space among workers. Specifically, the full ND range is first linearized in row-major order into a flat index space of size N = range.size(). For a 3x4 2D range, this looks like:

Then, workers claim contiguous chunks of the flat index space, either pre-assigned (static partitioner) or on demand via an atomic cursor (dynamic partitioners). During the assignment, each flat chunk boundary coincides with a hyperplane boundary, i.e., a suffix product of the dimension sizes (1, dim[N-1], dim[N-1] x dim[N-2], ...). This is the critical constraint: only boundaries that align to a complete inner row (or slice, or hyper-slice) produce a valid rectangular sub-box. The figure below illustrates why alignment matters for a 3x4 grid:

A cut at flat index 6 (mid-row) would split row i=1 between two workers, producing an L-shaped non-rectangular region that cannot be expressed as a single IndexRange sub-box. A cut at flat index 8 (row boundary) produces two clean rectangular sub-boxes. Taskflow enforces this automatically. The chunk_size hint is always snapped to the nearest hyperplane boundary via IndexRange::ceil():

// For a 3x4 range with 3 workers:
// raw N/W = 12/3 = 4 -> r.ceil(4) = 4 (already a row boundary)
// For a 3x5 range with 2 workers:
// raw N/W = 15/2 = 7 -> r.ceil(7) = 10 (next row boundary = 2 full rows)
Static vs dynamic partitioners

With a static partitioner, each worker's chunk is pre-assigned before execution and never changes. The figure below shows a 3x4 range evenly split among 3 workers, each receiving exactly one aligned row:

With a dynamic partitioner, workers pull chunks from a shared atomic cursor. The chunk may overshoot to the next hyperplane boundary when the requested size does not align exactly, but the cursor self-corrects so no element is visited twice.

Note
You do not need to think about alignment explicitly. Taskflow handles it automatically — the chunk_size you pass to the partitioner is a hint, and the scheduler snaps it to the nearest valid boundary before distributing work.

Capture Range by Reference

When the range bounds are not known at task-graph construction time, pass the range by std::ref so that an upstream task can set the bounds before the parallel loop runs. This works for both 1D and multi-dimensional ranges.

tf::IndexRange<int>(0, 0, 1), // placeholder — filled by init
);
auto init = taskflow.emplace([&](){
range.dim(0).reset(0, rows, 1);
range.dim(1).reset(0, cols, 1);
});
auto pf = taskflow.for_each_by_index(
std::ref(range),
for(int i = tile.dim(0).begin(); i < tile.dim(0).end(); i += tile.dim(0).step_size()) {
for(int j = tile.dim(1).begin(); j < tile.dim(1).end(); j += tile.dim(1).step_size()) {
process(i, j);
}
}
}
);
// The code below is wrong! range is captured by copy at construction time
// auto pf = taskflow.for_each_by_index(range, callable);
init.precede(pf);

When init finishes, pf reads the updated range and partitions the rows x cols space among workers.

Create an Iterator-based Parallel-Iteration Task

Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first and last. The task created by tf::Taskflow::for_each(B first, E last, C callable, P part) represents parallel execution of the following loop:

for(auto i = first; i != last; i++) {
callable(*i);
}

tf::Taskflow::for_each simultaneously applies the callable to the object obtained by dereferencing every iterator in the range [first, last). It is the user's responsibility to ensure the range is valid within the execution of the parallel-for task. Iterators must have the post-increment operator ++ defined.

std::vector<int> vec = {1, 2, 3, 4, 5};
taskflow.for_each(vec.begin(), vec.end(), [](int i) {
std::cout << "parallel for on item " << i << '\n';
});

Capture Iterators by Reference

Similar to index-based parallel-for, iterators can be passed by reference using std::ref so that one task can set up the range before another task performs the parallel-for.

std::vector<int> vec;
std::vector<int>::iterator first, last;
auto init = taskflow.emplace([&](){
vec.resize(1000);
first = vec.begin();
last = vec.end();
});
auto pf = taskflow.for_each(
std::ref(first), std::ref(last),
[](int i) {
std::cout << "parallel iteration on item " << i << '\n';
}
);
// The code below is wrong! first and last are captured by copy at construction time
// auto pf = taskflow.for_each(first, last, [](int i) { });
init.precede(pf);

When init finishes, pf will see first pointing to the beginning of vec and last pointing to the end of vec and performs parallel iterations over the 1000 items.

Configure a Partitioner

A partitioner controls how the iteration space is divided among workers. Taskflow provides four partitioners, each suited to different workload characteristics:

  • tf::StaticPartitioner divides the range into equal-sized chunks ahead of execution and assigns them to workers in order. It has the lowest scheduling overhead and delivers the best performance when every iteration costs roughly the same amount of work.
  • tf::DynamicPartitioner distributes fixed-sized chunks to workers on demand as they become available. It adapts well to workloads where iteration cost varies, at the expense of slightly higher coordination overhead.
  • tf::GuidedPartitioner distributes chunks whose size decreases adaptively as work is consumed — large chunks early to reduce overhead, smaller chunks late to balance the tail. This is the default partitioner and delivers stable, near-optimal performance across a wide range of workloads.
  • tf::RandomPartitioner distributes chunks of randomly sampled sizes, which can help avoid systematic load imbalances caused by data-dependent cost patterns.

The following example creates two parallel-iteration tasks with different partitioners:

std::vector<int> vec(1024, 0);
tf::StaticPartitioner static_partitioner(10); // chunk size of 10
tf::GuidedPartitioner guided_partitioner(10); // minimum chunk size of 10
taskflow.for_each(
vec.begin(), vec.end(),
[](int i) { std::cout << "static: " << i << '\n'; },
static_partitioner
);
taskflow.for_each(
vec.begin(), vec.end(),
[](int i) { std::cout << "guided: " << i << '\n'; },
guided_partitioner
);
class to create a guided partitioner for scheduling parallel algorithms
Definition partitioner.hpp:408
class to construct a static partitioner for scheduling parallel algorithms
Definition partitioner.hpp:262

As a rule of thumb, prefer tf::StaticPartitioner for uniform workloads (e.g., element-wise arithmetic on arrays) and tf::GuidedPartitioner for irregular workloads (e.g., graph traversal, variable-length processing). tf::DynamicPartitioner is a good choice when chunks must be kept small and strictly equal in size.

Note
By default, parallel-iteration tasks use tf::DefaultPartitioner (currently tf::GuidedPartitioner) if no partitioner is specified.