We study the classic problem, Fibonacci Number, to demonstrate the use of recursive task parallelism.
In mathematics, the Fibonacci numbers, commonly denoted as fibonacci(n), form a sequence such that each number is the sum of the two preceding ones, starting from 0 and 1.
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
A common solution for computing fibonacci numbers is recursion:
We use Runtime Tasking and Asynchronous Tasking to recursively compute Fibonacci numbers in parallel. A runtime task tasks a reference to tf::Runtime as its argument, allowing users to interact with the executor and spawn tasks dynamically. The example below demonstrates a parallel recursive implementation of Fibonacci numbers using tf::Runtime:
The fibonacci function recursively spawns two asynchronous tasks to compute fibonacci(N-1) and fibonacci(N-2) in parallel using tf::Runtime::silent_async. After spawning the two tasks, the function invokes tf::Runtime::corun() to wait until all tasks spawned by rt complete, without blocking the caller worker. In the main function, the executor creates an async task from the top Fibonacci number and waits for completion using tf::Executor::wait_for_all. Once finished, the result is printed. The figure below shows the execution diagram, where the suffixes *_1 and *_2 represent the left and right children spawned by their parent runtime:
In recursive parallelism, especially for problems like Fibonacci computation, spawning both recursive branches as asynchronous tasks can lead to excessive task creation and stack growth, which may degrade performance and overwhelm the runtime scheduler. Additionally, when both child tasks are launched asynchronously, the parent task must wait for both to finish, potentially blocking a worker thread and reducing parallel throughput. To address these issues, we apply tail recursion optimization to one branch of the Fibonacci call. This allows one of the recursive calls to proceed immediately in the current execution context, reducing both scheduling overhead and stack usage.
The figure below shows the execution diagram, where the suffix *_1 represent the left child spawned by its parent runtime. As we can see, the right child is optimized out through tail recursion optimization.
Based on the discussion above, we compare the runtime of recursive Fibonacci parallelism (1) with tail recursion optimization and (2) without it, across different Fibonacci numbers.
| N | w/ tail recursion optimization | w/o tail recursion optimization |
|---|---|---|
| 20 | 0.23 ms | 0.31 ms |
| 25 | 2 ms | 4 ms |
| 30 | 23 ms | 42 ms |
| 35 | 269 ms | 483 ms |
| 40 | 3003 ms | 5124 ms |
As N increases, the performance gap between the two versions widens significantly. With tail recursion optimization, the program avoids spawning another async task, thereby reducing scheduling overhead and stack pressure. This leads to better CPU utilization and lower task management cost. For example, at N = 40, tail recursion optimization reduces the runtime by over 40%.
Similar to tf::Runtime, tf::TaskGroup provides a lightweight and structured mechanism for expressing recursive parallelism directly within a running task (see Task Group). The example below demonstrates how to parallelize Fibonacci using a task group: