Fibonacci Number
We study the classic problem, Fibonacci Number, to demonstrate the use of recursive task parallelism.
Problem Formulation
In mathematics, the Fibonacci numbers, commonly denoted F(n)
, form a sequence such that each number is the sum of the two preceding ones, starting from 0 and 1.
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
A common solution for computing fibonacci numbers is recursion.
int fib(int n) { if(n < 2) return n; return fib(n-1) + fib(n-2); }
Recursive Fibonacci Parallelism
We use Runtime Tasking and Asynchronous Tasking to recursively compute Fibonacci numbers in parallel. A runtime task tasks a reference to tf::
#include <taskflow/taskflow.hpp> size_t fibonacci(size_t N, tf::Runtime& rt) { if(N < 2) return N; size_t res1, res2; rt.silent_async([N, &res1](tf::Runtime& rt1){ res1 = fibonacci(N-1, rt1); }); rt.silent_async([N, &res2](tf::Runtime& rt2){ res2 = fibonacci(N-2, rt2); }); // use corun to avoid blocking the worker from waiting the two children tasks // to finish rt.corun(); return res1 + res2; } int main() { tf::Executor executor; size_t N = 5, res; executor.silent_async([N, &res](tf::Runtime& rt){ res = fibonacci(N, rt); }); executor.wait_for_all(); std::cout << N << "-th Fibonacci number is " << res << '\n'; return 0; }
The fibonacci
function recursively spawns two asynchronous tasks to compute fibonacci(N-1)
and fibonacci(N-2)
in parallel using tf::
. After spawning the two tasks, the function invokes tf::rt
complete, without blocking the caller worker. In the main function, the executor creates an async task from the top Fibonacci number and waits for completion using tf::
Tail Recursion Optimization
In recursive parallelism, especially for problems like Fibonacci computation, spawning both recursive branches as asynchronous tasks can lead to excessive task creation and stack growth, which may degrade performance and overwhelm the runtime scheduler. Additionally, when both child tasks are launched asynchronously, the parent task must wait for both to finish, potentially blocking a worker thread and reducing parallel throughput. To address these issues, we apply tail recursion optimization to one branch of the Fibonacci call. This allows one of the recursive calls to proceed immediately in the current execution context, reducing both scheduling overhead and stack usage.
size_t fibonacci(size_t N, tf::Runtime& rt) { if(N < 2) return N; size_t res1, res2; rt.silent_async([N, &res1](tf::Runtime& rt1){ res1 = fibonacci(N-1, rt1); }); // tail optimization for the right child res2 = fibonacci(N-2, rt); // use corun to avoid blocking the worker from waiting the two children tasks // to finish rt.corun(); return res1 + res2; }
The figure below shows the execution diagram, where the suffix *_1 represent the left child spawned by its parent runtime. As we can see, the right child is optimized out through tail recursion optimization.
Benchmarking
Based on the discussion above, we compare the runtime of recursive Fibonacci parallelism (1) with tail recursion optimization and (2) without it, across different Fibonacci numbers.
N | w/ tail recursion optimization | w/o tail recursion optimization |
---|---|---|
20 | 0.23 ms | 0.31 ms |
25 | 2 ms | 4 ms |
30 | 23 ms | 42 ms |
35 | 269 ms | 483 ms |
40 | 3003 ms | 5124 ms |
As N
increases, the performance gap between the two versions widens significantly. With tail recursion optimization, the program avoids spawning another async task, thereby reducing scheduling overhead and stack pressure. This leads to better CPU utilization and lower task management cost. For example, at N = 40
, tail recursion optimization reduces the runtime by over 40%.