Learning from Examples » Fibonacci Number

We study the classic problem, Fibonacci Number, to demonstrate the use of recursive task parallelism.

Problem Formulation

In mathematics, the Fibonacci numbers, commonly denoted F(n), form a sequence such that each number is the sum of the two preceding ones, starting from 0 and 1.

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...

A common solution for computing fibonacci numbers is recursion.

int fib(int n) {
  if(n < 2) return n;
  return fib(n-1) + fib(n-2);
}

Recursive Fibonacci Parallelism

We use Runtime Tasking and Asynchronous Tasking to recursively compute Fibonacci numbers in parallel. A runtime task tasks a reference to tf::Runtime as its argument, allowing users to interact with the executor and spawn tasks dynamically. The example below demonstrates a parallel recursive implementation of Fibonacci numbers using tf::Runtime:

#include <taskflow/taskflow.hpp>

size_t fibonacci(size_t N, tf::Runtime& rt) {

  if(N < 2) return N; 

  size_t res1, res2;
  rt.silent_async([N, &res1](tf::Runtime& rt1){ res1 = fibonacci(N-1, rt1); });
  rt.silent_async([N, &res2](tf::Runtime& rt2){ res2 = fibonacci(N-2, rt2); });

  // use corun to avoid blocking the worker from waiting the two children tasks 
  // to finish
  rt.corun();

  return res1 + res2;
}

int main() {

  tf::Executor executor;
  
  size_t N = 5, res;
  executor.silent_async([N, &res](tf::Runtime& rt){ res = fibonacci(N, rt); });
  executor.wait_for_all();

  std::cout << N << "-th Fibonacci number is " << res << '\n';

  return 0;
}

The fibonacci function recursively spawns two asynchronous tasks to compute fibonacci(N-1) and fibonacci(N-2) in parallel using tf::Runtime::silent_async. After spawning the two tasks, the function invokes tf::Runtime::corun() to wait until all tasks spawned by rt complete, without blocking the caller worker. In the main function, the executor creates an async task from the top Fibonacci number and waits for completion using tf::Executor::wait_for_all. Once finished, the result is printed. The figure below shows the execution diagram, where the suffixes *_1 and *_2 represent the left and right children spawned by their parent runtime:

Fibonacci F4 fibonacci(4) [rt] F3_1 fibonacci(3) [rt1] F4->F3_1 F2_2 fibonacci(2) [rt2] F4->F2_2 F2_1 fibonacci(2) [rt1_1] F3_1->F2_1 F1_2 fibonacci(1) [rt1_2] F3_1->F1_2 F1_1 fibonacci(1) [rt1_1_1] F2_1->F1_1 F0_1 fibonacci(0) [rt1_1_2] F2_1->F0_1 F1_3 fibonacci(1) [rt2_1] F2_2->F1_3 F0_2 fibonacci(0) [rt2_2] F2_2->F0_2

Tail Recursion Optimization

In recursive parallelism, especially for problems like Fibonacci computation, spawning both recursive branches as asynchronous tasks can lead to excessive task creation and stack growth, which may degrade performance and overwhelm the runtime scheduler. Additionally, when both child tasks are launched asynchronously, the parent task must wait for both to finish, potentially blocking a worker thread and reducing parallel throughput. To address these issues, we apply tail recursion optimization to one branch of the Fibonacci call. This allows one of the recursive calls to proceed immediately in the current execution context, reducing both scheduling overhead and stack usage.

size_t fibonacci(size_t N, tf::Runtime& rt) {

  if(N < 2) return N; 

  size_t res1, res2;
  rt.silent_async([N, &res1](tf::Runtime& rt1){ res1 = fibonacci(N-1, rt1); });
  
  // tail optimization for the right child
  res2 = fibonacci(N-2, rt);

  // use corun to avoid blocking the worker from waiting the two children tasks 
  // to finish
  rt.corun();

  return res1 + res2;
}

The figure below shows the execution diagram, where the suffix *_1 represent the left child spawned by its parent runtime. As we can see, the right child is optimized out through tail recursion optimization.

Fibonacci F4 fibonacci(4) [rt] F3_1 fibonacci(3) [rt1] F4->F3_1 F2_2 fibonacci(2) [rt] F4->F2_2 F2_1 fibonacci(2) [rt1_1] F3_1->F2_1 F1_2 fibonacci(1) [rt1] F3_1->F1_2 F1_1 fibonacci(1) [rt1_1_1] F2_1->F1_1 F0_1 fibonacci(0) [rt1_1] F2_1->F0_1 F1_3 fibonacci(1) [rt1] F2_2->F1_3 F0_2 fibonacci(0) [rt] F2_2->F0_2

Benchmarking

Based on the discussion above, we compare the runtime of recursive Fibonacci parallelism (1) with tail recursion optimization and (2) without it, across different Fibonacci numbers.

Nw/ tail recursion optimizationw/o tail recursion optimization
200.23 ms0.31 ms
252 ms4 ms
3023 ms42 ms
35269 ms483 ms
403003 ms5124 ms

As N increases, the performance gap between the two versions widens significantly. With tail recursion optimization, the program avoids spawning another async task, thereby reducing scheduling overhead and stack pressure. This leads to better CPU utilization and lower task management cost. For example, at N = 40, tail recursion optimization reduces the runtime by over 40%.