We study how to express divide-and-conquer parallelism using Taskflow, using parallel merge sort as a concrete example. Divide-and-conquer is one of the most important parallel programming patterns and appears across sorting, searching, tree traversal, numerical computation, and graph algorithms.
A divide-and-conquer algorithm solves a problem by recursively splitting it into two or more independent subproblems, solving each subproblem separately, and combining the results. The recursive structure is:
The key observation is that the two recursive calls, solve(P1) and solve(P2), are independent of each other and can therefore run in parallel. The combine step must wait for both, but no other dependency exists.
Taskflow's tf::TaskGroup provides a natural way to express this: spawn one subproblem asynchronously, solve the other inline (tail optimisation), then call corun to wait cooperatively for the spawned task before combining. Note that corun does not block the calling worker but the program control flow. The calling worker remain participating in the work-stealing loop to help execute the spawned subtask or any other available work in a cooperative manner.
Merge sort is the canonical divide-and-conquer algorithm:
The sequential implementation is:
We parallelise the two recursive calls using tf::TaskGroup. A task group is created from the executor and provides the same silent_async and corun interface as tf::Runtime, but can be used from any execution context of a worker. There is no need for the caller to already be a runtime task.
We also apply a cutoff: when the subrange is smaller than a threshold, we fall back to std::sort rather than spawning more tasks. Without a cutoff, the recursion would create far more tasks than there are cores, and the scheduling overhead would dominate the actual computation:
Several design choices in this example apply broadly to parallel divide-and-conquer algorithms:
Recursively spawning tasks down to single elements would create O(N) tasks with trivial work each. The cutoff switches to std::sort for small subranges, keeping the task count proportional to the available parallelism rather than the input size. A good rule of thumb is to set the cutoff so that the sequential base case takes at least a few microseconds — typically a few thousand elements for integer sorting.
Only the left half is spawned as an async task; the right half is sorted inline in the current execution context. This halves the number of tasks at every level of the recursion tree and eliminates one level of task-creation overhead per recursive call.
Calling tf::TaskGroup::corun does not block the calling thread from making progress but only the program control flow (i.e., the program execution will not proceed until corun returns). Instead, the calling thread participates in the work-stealing loop while waiting for the left-half task to complete, ensuring all threads remain productive. This is essential for recursive parallelism: if the calling thread blocked, it would hold a worker hostage while the spawned task waits for a free worker, potentially causing deadlock or severe under-utilisation.
A tf::TaskGroup can only be created by a worker of an executor due to the support for cooperative execution.
The same tf::TaskGroup pattern applies to any divide-and-conquer algorithm. Replace the merge sort specifics with the divide/conquer/combine steps of your algorithm:
Examples of divide-and-conquer algorithms that fit this template directly:
Runtime comparison for sorting 1M random integers on a 12-core machine:
| Algorithm | Time |
|---|---|
std::sort (sequential) | 180 ms |
| Parallel merge sort (cutoff=1024) | 38 ms |
| Speed-up | ~4.7× |
The speed-up is sub-linear because std::inplace_merge at the top levels of the recursion tree is sequential and dominates as the recursive parallelism collapses. Using a parallel merge step or switching to parallel quicksort (where the combine step is trivial) would push the speed-up closer to the core count.