tf::cudaExecutionPolicy class

class to define execution policy for CUDA standard algorithms

Template parameters
NT	number of threads per block
VT	number of work units per thread

Execution policy configures the kernel execution parameters in CUDA algorithms. The first template argument, NT, the number of threads per block should always be a power-of-two number. The second template argument, VT, the number of work units per thread is recommended to be an odd number to avoid bank conflict.

Details can be referred to Execution Policy.

Public static variables

static const unsigned nt: static constant for getting the number of threads per block
static const unsigned vt: static constant for getting the number of work units per thread
static const unsigned nv: static constant for getting the number of elements to process per block

Public static functions

static auto num_blocks(unsigned N) -> unsigned: queries the number of blocks to accommodate N elements
template<typename T> static auto reduce_bufsz(unsigned count) -> unsigned: queries the buffer size in bytes needed to call reduce kernels
template<typename T> static auto min_element_bufsz(unsigned count) -> unsigned: queries the buffer size in bytes needed to call tf::cuda_min_element
template<typename T> static auto max_element_bufsz(unsigned count) -> unsigned: queries the buffer size in bytes needed to call tf::cuda_max_element
template<typename T> static auto scan_bufsz(unsigned count) -> unsigned: queries the buffer size in bytes needed to call scan kernels
static auto merge_bufsz(unsigned a_count, unsigned b_count) -> unsigned: queries the buffer size in bytes needed for CUDA merge algorithms

Constructors, destructors, conversion operators

cudaExecutionPolicy() defaulted: constructs an execution policy object with default stream
cudaExecutionPolicy(cudaStream_t s) explicit: constructs an execution policy object with the given stream

Public functions

auto stream() -> cudaStream_t noexcept: queries the associated stream
void stream(cudaStream_t stream) noexcept: assigns a stream

Function documentation

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::reduce_bufsz(unsigned count)

queries the buffer size in bytes needed to call reduce kernels

Template parameters
T	value type
Parameters
count	number of elements to reduce

The function is used to allocate a buffer for calling tf::cuda_reduce, tf::cuda_uninitialized_reduce, tf::cuda_transform_reduce, and tf::cuda_uninitialized_transform_reduce.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::min_element_bufsz(unsigned count)

queries the buffer size in bytes needed to call tf::cuda_min_element

Template parameters
T	value type
Parameters
count	number of elements to search

The function is used to decide the buffer size in bytes for calling tf::cuda_min_element.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::max_element_bufsz(unsigned count)

queries the buffer size in bytes needed to call tf::cuda_max_element

Template parameters
T	value type
Parameters
count	number of elements to search

The function is used to decide the buffer size in bytes for calling tf::cuda_max_element.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::scan_bufsz(unsigned count)

queries the buffer size in bytes needed to call scan kernels

Template parameters
T	value type
Parameters
count	number of elements to scan

The function is used to allocate a buffer for calling tf::cuda_inclusive_scan, tf::cuda_exclusive_scan, tf::cuda_transform_inclusive_scan, and tf::cuda_transform_exclusive_scan.

template<unsigned NT, unsigned VT>
static unsigned tf::cudaExecutionPolicy<NT, VT>::merge_bufsz(unsigned a_count, unsigned b_count)

queries the buffer size in bytes needed for CUDA merge algorithms

Parameters
a_count	number of elements in the first vector to merge
b_count	number of elements in the second vector to merge

The buffer size of merge algorithm does not depend on the data type. The buffer is purely used only for storing temporary indices (of type unsigned) required during the merge process.

The function is used to allocate a buffer for calling tf::cuda_merge and tf::cuda_merge_by_key.

#include <taskflow/cuda/cuda_execution_policy.hpp> template<unsigned NT, unsigned VT> tf::cudaExecutionPolicy class

Public static variables

Public static functions

Constructors, destructors, conversion operators

Public functions

Function documentation

template<unsigned NT, unsigned VT> template<typename T> static unsigned tf::cudaExecutionPolicy<NT, VT>::reduce_bufsz(unsigned count)

template<unsigned NT, unsigned VT> template<typename T> static unsigned tf::cudaExecutionPolicy<NT, VT>::min_element_bufsz(unsigned count)

template<unsigned NT, unsigned VT> template<typename T> static unsigned tf::cudaExecutionPolicy<NT, VT>::max_element_bufsz(unsigned count)