We study recursive task parallelism using the Fibonacci sequence as a concrete example, demonstrating how tf::Runtime enables dynamic task creation and cooperative synchronisation without blocking worker threads.
The Fibonacci sequence is defined such that each number is the sum of the two preceding ones, starting from 0 and 1:
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
The natural recursive implementation is:
The two recursive calls are mutually independent and can therefore run in parallel. The challenge is to express this recursive parallelism efficiently — spawning tasks that themselves spawn tasks — without blocking worker threads or creating unbounded overhead.
A runtime task accepts a tf::Runtime& parameter, giving it a live handle to the scheduling runtime. We use tf::Runtime::silent_async to spawn both recursive branches in parallel and tf::Runtime::corun to wait for them cooperatively. corun does not block the calling worker — it participates in the work-stealing loop while waiting, so other tasks can run on the same thread:
Each call to fibonacci recursively spawns two async tasks and then waits cooperatively for both to finish. The executor distributes tasks across all available workers using work-stealing, achieving parallelism at every level of the recursion tree. The execution diagram for fibonacci(4) is shown below. The suffixes _1 and _2 denote the left and right children spawned by their parent runtime:
Spawning both branches asynchronously doubles the number of async tasks created: one branch is always available for inline computation in the current execution context, so spawning it asynchronously only adds scheduling overhead without increasing parallelism. The standard optimisation is to compute one branch inline (directly in the current context) and spawn only the other asynchronously. This halves the task count, reduces stack pressure, and cuts scheduling overhead at every level of recursion:
The optimised execution diagram for fibonacci(4) is shown below. The right branch at each level has been eliminated, replaced by inline computation:
The table below compares wall-clock runtime of the two implementations across different values of N on a multi-core machine:
| N | with tail optimisation | without tail optimisation |
|---|---|---|
| 20 | 0.23 ms | 0.31 ms |
| 25 | 2 ms | 4 ms |
| 30 | 23 ms | 42 ms |
| 35 | 269 ms | 483 ms |
| 40 | 3003 ms | 5124 ms |
The performance gap widens as N increases because the number of tasks created grows exponentially. At N = 40, tail optimisation cuts runtime by over 40% by eliminating half the task-creation and scheduling overhead at every recursive level.
tf::TaskGroup offers a lighter-weight alternative to tf::Runtime for recursive parallelism from any calling context — no need for the caller itself to be a runtime task. The interface is the same: spawn sub-tasks with silent_async, then call corun to wait cooperatively.