Taskflow provides template functions for constructing tasks to perform parallel iterations over ranges of items.
You need to include the header file, taskflow/algorithm/for_each.hpp, for using parallel-iteration algorithms.
Index-based parallel-for performs parallel iterations over a range [first, last) with the given step size. The task created by tf::Taskflow::for_each_index(B first, E last, S step, C callable, P part) represents parallel execution of the following loop:
We support only integer-based ranges. The range can go in a positive or negative direction.
Notice that the direction is defined in terms of the half-open range [first, last), where last is excluded. In the positive case, the 50 items are 0, 2, 4, 6, ..., 96, 98. In the negative case, the 50 items are 100, 98, 96, 94, ..., 4, 2. An example of the Taskflow graph for the positive case under 5 workers is depicted below:
You can pass indices by reference using std::ref to marshal parameter updates between dependent tasks. This is useful when the range is not known at the time the task graph is constructed but is initialized by an upstream task.
When init finishes, the parallel-for task pf will see first as 0 and last as 1000 and performs parallel iterations over the 1000 indices.
While tf::Taskflow::for_each_index gives each worker a single scalar index, tf::Taskflow::for_each_by_index gives each worker a contiguous sub-range (or sub-box in multiple dimensions) of the iteration space. This distinction is significant: instead of processing one element at a time, the callable operates on an entire block of consecutive indices in a single invocation, enabling block algorithms that exploit data locality, facilitate SIMD vectorisation, and reduce scheduling overhead.
A tf::IndexRange describes a typed, half-open index range [begin, end) with a step size. There are two variants:
tf::IndexRange<T> (or equivalently tf::IndexRange<T, 1>): A 1D range of integral indices. It is defined by three parameters: begin, end, and step_size. The elements are begin, begin+step, begin+2*step, ... up to (but not including) end.tf::IndexRange<T, N>: An N-dimensional Cartesian product of N independent 1D ranges, one per dimension. Dimension 0 is the outermost (slowest varying) and dimension N-1 is the innermost (fastest varying), matching the natural nesting of C-style for-loops (row-major order).The key advantage of tf::IndexRange<T, N> over a raw scalar index is that the callable receives the full sub-range (or sub-box) at once, allowing it to implement cache-friendly block algorithms, apply SIMD over contiguous indices, or exploit any other block-level optimisation.
Passing a 1D tf::IndexRange<T> to tf::Taskflow::for_each_by_index creates a parallel task whose callable receives one contiguous sub-range of the original range per invocation:
Because the callable sees the full sub-range at once, it can implement cache-friendly block algorithms that would be impossible when receiving a single index at a time. For instance, a worker assigned [32, 64) can prefetch that entire cache line before processing, rather than incurring per-element dispatch overhead.
Passing an N-dimensional tf::IndexRange<T, N> to tf::Taskflow::for_each_by_index creates a parallel task whose callable receives one orthogonal sub-box of the full ND space per invocation. A sub-box is a valid hyperrectangle — a contiguous, axis-aligned tile that preserves the original step sizes of every dimension. The figure below shows a 2x3x4 range decomposed into two sub-boxes (one per outer-dimension value), each a 3x4 slice:
For a 2D range, the parallel task represents:
Each sub-box preserves the original step sizes for every dimension, so the inner loops are identical to what you would write for sequential code. The same pattern extends naturally to three or more dimensions:
When a dimension has zero size, it and all dimensions inner to it produce no iterations — but outer dimensions are unaffected. This matches the behaviour of sequential nested loops: a zero-size inner loop simply never executes, while the outer loops still run.
This is equivalent to the sequential nested loop:
collapse clause, which treats the entire iteration space as a Cartesian product and yields zero total iterations when any dimension is empty. Taskflow preserves the sequential semantics of nested loops: outer dimensions always execute independently of inner ones.Taskflow schedules parallel iterations over an index range by mapping the N-dimensional space to a 1D flat index space and distributing chunks of that flat space among workers. Specifically, the full ND range is first linearized in row-major order into a flat index space of size N = range.size(). For a 3x4 2D range, this looks like:
Then, workers claim contiguous chunks of the flat index space, either pre-assigned (static partitioner) or on demand via an atomic cursor (dynamic partitioners). During the assignment, each flat chunk boundary coincides with a hyperplane boundary, i.e., a suffix product of the dimension sizes (1, dim[N-1], dim[N-1] x dim[N-2], ...). This is the critical constraint: only boundaries that align to a complete inner row (or slice, or hyper-slice) produce a valid rectangular sub-box. The figure below illustrates why alignment matters for a 3x4 grid:
A cut at flat index 6 (mid-row) would split row i=1 between two workers, producing an L-shaped non-rectangular region that cannot be expressed as a single IndexRange sub-box. A cut at flat index 8 (row boundary) produces two clean rectangular sub-boxes. Taskflow enforces this automatically. The chunk_size hint is always snapped to the nearest hyperplane boundary via IndexRange::ceil():
With a static partitioner, each worker's chunk is pre-assigned before execution and never changes. The figure below shows a 3x4 range evenly split among 3 workers, each receiving exactly one aligned row:
With a dynamic partitioner, workers pull chunks from a shared atomic cursor. The chunk may overshoot to the next hyperplane boundary when the requested size does not align exactly, but the cursor self-corrects so no element is visited twice.
chunk_size you pass to the partitioner is a hint, and the scheduler snaps it to the nearest valid boundary before distributing work.When the range bounds are not known at task-graph construction time, pass the range by std::ref so that an upstream task can set the bounds before the parallel loop runs. This works for both 1D and multi-dimensional ranges.
When init finishes, pf reads the updated range and partitions the rows x cols space among workers.
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first and last. The task created by tf::Taskflow::for_each(B first, E last, C callable, P part) represents parallel execution of the following loop:
tf::Taskflow::for_each simultaneously applies the callable to the object obtained by dereferencing every iterator in the range [first, last). It is the user's responsibility to ensure the range is valid within the execution of the parallel-for task. Iterators must have the post-increment operator ++ defined.
Similar to index-based parallel-for, iterators can be passed by reference using std::ref so that one task can set up the range before another task performs the parallel-for.
When init finishes, pf will see first pointing to the beginning of vec and last pointing to the end of vec and performs parallel iterations over the 1000 items.
A partitioner controls how the iteration space is divided among workers. Taskflow provides four partitioners, each suited to different workload characteristics:
The following example creates two parallel-iteration tasks with different partitioners:
As a rule of thumb, prefer tf::StaticPartitioner for uniform workloads (e.g., element-wise arithmetic on arrays) and tf::GuidedPartitioner for irregular workloads (e.g., graph traversal, variable-length processing). tf::DynamicPartitioner is a good choice when chunks must be kept small and strictly equal in size.