Release Notes » Release 3.6.0 (2023/05/07)

Taskflow 3.6.0 is the 7th release in the 3.x line! This release includes several new changes, such as dynamic task graph parallelism, improved parallel algorithms, modified GPU tasking interface, documentation, examples, and unit tests.

Download

Taskflow 3.6.0 can be downloaded from here.

System Requirements

To use Taskflow v3.6.0, you need a compiler that supports C++17:

GNU C++ Compiler at least v8.4 with -std=c++17
Clang C++ Compiler at least v6.0 with -std=c++17
Microsoft Visual Studio at least v19.27 with /std:c++17
AppleClang Xcode Version at least v12.0 with -std=c++17
Nvidia CUDA Toolkit and Compiler (nvcc) at least v11.1 with -std=c++17
Intel C++ Compiler at least v19.0.1 with -std=c++17
Intel DPC++ Clang Compiler at least v13.0.0 with -std=c++17 and SYCL20

Taskflow works on Linux, Windows, and Mac OS X.

Release Summary

This release contains several changes to largely enhance the programmability of GPU tasking and standard parallel algorithms. More importantly, we have introduced a new dependent asynchronous tasking model that offers great flexibility for expressing dynamic task graph parallelism.

New Features

Taskflow Core

Added new async methods to support dynamic task graph creation
Added new async and join methods to tf::Runtime
Added a new partitioner interface to optimize parallel algorithms
Added parallel-scan algorithms to Taskflow
Added parallel-find algorithms to Taskflow
- tf::Taskflow::find_if(B first, E last, T& result, UOP predicate, P&& part)
- tf::Taskflow::find_if_not(B first, E last, T& result, UOP predicate, P&& part)
- tf::Taskflow::min_element(B first, E last, T& result, C comp, P&& part)
- tf::Taskflow::max_element(B first, E last, T& result, C comp, P&& part)
Modified tf::Subflow as a derived class from tf::Runtime
Extended parallel algorithms to support different partitioning algorithms
- tf::Taskflow::for_each_index(B first, E last, S step, C callable, P&& part)
- tf::Taskflow::for_each(B first, E last, C callable, P&& part)
- tf::Taskflow::transform(B first1, E last1, O d_first, C c, P&& part)
- tf::Taskflow::transform(B1 first1, E1 last1, B2 first2, O d_first, C c, P&& part)
- tf::Taskflow::reduce(B first, E last, T& result, O bop, P&& part)
- tf::Taskflow::transform_reduce(B first, E last, T& result, BOP bop, UOP uop, P&& part)
Improved the performance of tf::Taskflow::sort for plain-old-data (POD) type
Extended task-parallel pipeline to handle token dependencies
- Task-parallel Pipeline with Token Dependencies

cudaFlow

removed algorithms that require buffer from tf::cudaFlow due to update limitation
removed support for a dedicated cudaFlow task in Taskflow
- all usage of tf::cudaFlow and tf::cudaFlowCapturer are standalone now

Utilities

Added all_same templates to check if a parameter pack has the same type

Taskflow Profiler (TFProf)

Removed cudaFlow and syclFlow tasks

Bug Fixes

Fixed the compilation error caused by clashing MAX_PRIORITY wtih winspool.h (#459)
Fixed the compilation error caused by tf::TaskView::for_each_successor and tf::TaskView::for_each_dependent
Fixed the infinite-loop bug when corunning a module task from tf::Runtime

If you encounter any potential bugs, please submit an issue at issue tracker.

Breaking Changes

Dropped support for cancelling asynchronous tasks

// previous - no longer supported
tf::Future<int> fu = executor.async([](){
  return 1;
});
fu.cancel();
std::optional<int> res = fu.get();  // res may be std::nullopt or 1

// now - use std::future instead
std::future<int> fu = executor.async([](){
  return 1;
});
int res = fu.get();

Dropped in-place support for running tf::cudaFlow from a dedicated task

// previous - no longer supported
taskflow.emplace([](tf::cudaFlow& cf){
  cf.offload();
});

// now - user to fully control tf::cudaFlow for maximum flexibility
taskflow.emplace([](){
  tf::cudaFlow cf;

  // offload the cudaflow asynchronously through a stream
  tf::cudaStream stream;
  cf.run(stream);

  // wait for the cudaflow completes
  stream.synchronize();
});

Dropped in-place support for running tf::cudaFlowCapturer from a dedicated task

// previous - now longer supported
taskflow.emplace([](tf::cudaFlowCapturer& cf){
  cf.offload();
});

// now - user to fully control tf::cudaFlowCapturer for maximum flexibility
taskflow.emplace([](){
  tf::cudaFlowCapturer cf;

  // offload the cudaflow asynchronously through a stream
  tf::cudaStream stream;
  cf.run(stream);

  // wait for the cudaflow completes
  stream.synchronize();
});

Dropped in-place support for running tf::syclFlow from a dedicated task
- SYCL can just be used out of box together with Taskflow
Move all buffer query methods of CUDA standard algorithms inside execution policy
- tf::cudaExecutionPolicy<NT, VT>::reduce_bufsz
- tf::cudaExecutionPolicy<NT, VT>::scan_bufsz
- tf::cudaExecutionPolicy<NT, VT>::merge_bufsz
- tf::cudaExecutionPolicy<NT, VT>::min_element_bufsz
- tf::cudaExecutionPolicy<NT, VT>::max_element_bufsz

// previous - no longer supported
tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N);

// now (and similarly for other parallel algorithms)
tf::cudaDefaultExecutionPolicy policy(stream);
policy.reduce_bufsz<int>(N);

Renamed tf::Executor::run_and_wait to tf::Executor::corun for expressiveness
Renamed tf::Executor::loop_until to tf::Executor::corun_until for expressiveness
Renamed tf::Runtime::run_and_wait to tf::Runtime::corun for expressiveness
Disabled argument support for all asynchronous tasking features
- users are responsible for creating their own wrapper to make the callable

// previous - async allows passing arguments to the callable
executor.async([](int i){ std::cout << i << std::endl; }, 4);  

// now - users are responsible of wrapping the arumgnets into a callable
executor.async([i=4]( std::cout << i << std::endl; ){});

Replaced named_async with an overload that takes the name string on the first argument

// previous - explicitly calling named_async to assign a name to an async task
executor.named_async("name", [](){});

// now - overlaod
executor.async("name", [](){});

Documentation

Revised Request Cancellation to remove support of cancelling async tasks
Revised Asynchronous Tasking to include asynchronous tasking from tf::Runtime
- Launch Asynchronous Tasks from a Runtime
Revised Taskflow algorithms to include execution policy
Added Task-parallel Pipeline with Token Dependencies
Added Parallel Scan
Added Asynchronous Tasking with Dependencies

Miscellaneous Items

We have published Taskflow in the following venues:

Dian-Lun Lin, Yanqing Zhang, Haoxing Ren, Shih-Hsin Wang, Brucek Khailany and Tsung-Wei Huang, "GenFuzz: GPU-accelerated Hardware Fuzzing using Genetic Algorithm with Multiple Inputs," ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, 2023
Tsung-Wei Huang, "qTask: Task-parallel Quantum Circuit Simulation with Incrementality," IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, Florida, 2023
Elmir Dzaka, Dian-Lun Lin, and Tsung-Wei Huang, "Parallel And-Inverter Graph Simulation Using a Task-graph Computing System," IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), St. Petersburg, Florida, 2023

Please do not hesitate to contact Dr. Tsung-Wei Huang if you intend to collaborate with us on using Taskflow in your scientific computing projects.