Taskflow provides a function for constructing a task to merge two sorted ranges into a single sorted output range in parallel.
You need to include the header file, taskflow/algorithm/merge.hpp, for creating a parallel-merge task.
The standard library std::merge walks both input sequences simultaneously from left to right in a single thread, producing a sorted output in O(n+m) time. While optimal for a single core, this sequential walk cannot exploit multiple CPU cores — at any point only one comparison is in flight. For large inputs (e.g. merging two sorted arrays of tens of millions of elements), this leaves the majority of available hardware idle.
Taskflow's parallel merge divides the output into W independent chunks (one per worker thread) and uses the co-rank technique to identify each worker's exact input sub-ranges, allowing all workers to merge their chunks simultaneously with no synchronization.
The task created by tf::Taskflow::merge(B1 first1, E1 last1, B2 first2, E2 last2, O d_first) merges the two sorted ranges [first1, last1) and [first2, last2) into the output range beginning at d_first, using std::less as the comparator. It represents the parallel execution of std::merge:
The following example merges two sorted vectors of one million integers using four worker threads:
To merge with a custom comparator, pass it as the last argument. Both input ranges must be sorted with respect to that comparator:
You can pass iterators by reference using std::ref to marshal parameter updates between dependent tasks. This is useful when the ranges are not known at task-graph construction time but are initialized by an upstream task.
When init finishes, the merge task reads the updated iterators and merges the two runtime-defined sequences.
The merged output has N = n + m elements (wheren = |seq1| and m = |seq2|).
The algorithm divides the output into W equal contiguous chunks of size K = N/W, one per worker thread. Worker w is responsible for writing output positions [w*K, (w+1)*K). The key challenge here is that for each output chunk, which elements of seq1 and seq2 belong to it? This is what the co-rank function solves.
For a given output position rank, co_rank(rank) finds i and j such that:
rank elements of the merged output consist of exactly i elements from seq1[0..i) and j elements from seq2[0..j).i + j elements are interleaved in sorted order in the output. Note that co_rank does not decide how they go in blocks but only identifies how many elements come from each sequence. The actual placement of these elements are accomplished via std::merge.Once a worker knows (i_start, j_start) at its chunk's beginning and (i_end, j_end) at its chunk's end, it can independently merge seq1[i_start..i_end) with seq2[j_start..j_end) and write the result directly to its output region, with no communication with other workers.
For a given rank, co_rank binary-searches for the unique i in the range [max(0, rank-m), min(n, rank)] such that the partition (seq1[0..i), seq2[0..rank-i)) is valid. A partition is valid when:
seq2: seq1[i-1] <= seq2[j]seq1: seq2[j-1] <= seq1[i]Because both sequences are sorted, as i increases:
seq1[i-1] increases (moving right in seq1)seq2[j-1] decreases (moving left in seq2, since j = rank - i decreases)This means the condition seq2[j-1] <= seq1[i] transitions from false to true exactly once — making binary search applicable. The algorithm needs to check only one condition per iteration:
seq2[j-1] > seq1[i]: i is too large -> high = ii is not large enough -> low = i + 1The figure below shows two iterations of the binary search for rank=5 on sequences seq1=[1,3,5,7,9,11] and seq2=[2,4,6,8,10,12]:
The search converges to i=3, j=2: the first 5 merged elements consist of seq1[0,3)=[1,3,5] and seq2[0,2)=[2,4], which interleave to [1,2,3,4,5].
Unlike tf::Taskflow::for_each, where users can configure a partitioner to adapt to unequal per-element costs, std::merge on a chunk of size K always costs O(K) regardless of the data values. There is little load imbalance to mitigate. As a result, tf::Taskflow::merge always adopts static partitioning, i.e., W chunks of size N/W and one per worker. which is always the optimal strategy for parallel merge.