Taskflow provides template functions for constructing tasks to perform parallel reduction over a range of items.
You need to include the header file, taskflow/algorithm/reduce.hpp, for creating a parallel-reduction task.
The task created by tf::Taskflow::reduce(B first, E last, T& result, O bop, P part) performs parallel reduction over the range [first, last) using the binary operator bop and stores the reduced result in result. It represents the parallel execution of the following loop:
At runtime, the reduction task partitions the range among workers, each computing a partial result, and then combines those partial results into result using bop. The initial value of result participates in the reduction — it is combined with the partial results as if it were an additional element. result is captured by reference inside the task; it is the user's responsibility to ensure it remains alive during execution.
The order in which bop is applied to pairs of elements is unspecified. Elements of the range may be grouped and rearranged in arbitrary order, as illustrated below for a sum-reduction over eight elements:
The result and argument types of bop must be consistent with the element type.
You can pass iterators by reference using std::ref to marshal parameter updates between dependent tasks. This is useful when the range is not known at task-graph construction time but is initialized by an upstream task.
When init finishes, first and last point to the initialized data range of vec, and the reduction task performs parallel reduction over the 10 elements.
It is common to transform each element into a new type and then reduce over the transformed values. The task created by tf::Taskflow::transform_reduce(B first, E last, T& result, BOP bop, UOP uop, P part) applies the unary operator uop to each element and then performs parallel reduction over result and the transformed values using bop. It represents the parallel execution of the following loop:
The example below transforms each digit character in a string to an integer and then sums them in parallel:
The order in which bop is applied to the transformed elements is unspecified. It is possible that bop will receive r-value arguments from both sides (e.g., bop(uop(*itr1), uop(*itr2))) due to transformed temporaries. When data passing is expensive, define the result type T to be move-constructible.
As with tf::Taskflow::reduce, iterators can be passed by reference using std::ref so that an upstream task can set up the range before the parallel-transform-reduction runs.
Unlike tf::Taskflow::reduce, which gives each worker a single element at a time, tf::Taskflow::reduce_by_index gives each worker a contiguous subrange of the index space. This allows the local reduction to be written as an explicit loop over the subrange, enabling optimisations such as SIMD vectorisation, custom accumulator types, or data initialisation interleaved with reduction. The method, tf::Taskflow::reduce_by_index, represents the parallel execution of the following two-phase loop:
The local operator lop is invoked once per subrange assigned to a worker. Its second argument is a std::optional<T> carrying the running total accumulated by that worker so far:
std::nullopt on the first subrange processed by a worker — the worker should initialise its accumulator from scratch.The global operator gop combines the per-worker partial results and the initial value of result into the final answer.
The example below performs a sum-reduction over a large array, initialising each element inside the local reducer:
The global reducer combines all partial results with the initial value of res (here 1.0), so the final answer is 1.0 + 100000.0 = 100001.0.
You can pass the index range by reference using std::ref so that an upstream task can set the bounds before the parallel-reduce-by-index runs.
A partitioner controls how the iteration space is divided among workers. Taskflow provides four partitioners, each suited to different workload characteristics:
The following example creates two parallel-reduction tasks using different partitioners:
As a rule of thumb, prefer tf::StaticPartitioner when every element costs the same to reduce (e.g., summation over a plain array) and tf::GuidedPartitioner for irregular workloads (e.g., reductions whose cost depends on the element value). tf::DynamicPartitioner is a good choice when chunks must be kept small and strictly equal in size.