template<unsigned NT, unsigned VT>
tf::cudaExecutionPolicy class

class to define execution policy for CUDA standard algorithms

Template parameters
NT number of threads per block
VT number of work units per thread

Execution policy configures the kernel execution parameters in CUDA algorithms. The first template argument, NT, the number of threads per block should always be a power-of-two number. The second template argument, VT, the number of work units per thread is recommended to be an odd number to avoid bank conflict.

Details can be referred to Execution Policy.

Public static variables

static const unsigned nt
static constant for getting the number of threads per block
static const unsigned vt
static constant for getting the number of work units per thread
static const unsigned nv
static constant for getting the number of elements to process per block

Public static functions

static auto num_blocks(unsigned N) -> unsigned
queries the number of blocks to accommodate N elements
template<typename T>
static auto reduce_bufsz(unsigned count) -> unsigned
queries the buffer size in bytes needed to call reduce kernels
template<typename T>
static auto min_element_bufsz(unsigned count) -> unsigned
queries the buffer size in bytes needed to call tf::cuda_min_element
template<typename T>
static auto max_element_bufsz(unsigned count) -> unsigned
queries the buffer size in bytes needed to call tf::cuda_max_element
template<typename T>
static auto scan_bufsz(unsigned count) -> unsigned
queries the buffer size in bytes needed to call scan kernels
static auto merge_bufsz(unsigned a_count, unsigned b_count) -> unsigned
queries the buffer size in bytes needed for CUDA merge algorithms

Constructors, destructors, conversion operators

cudaExecutionPolicy() defaulted
constructs an execution policy object with default stream
cudaExecutionPolicy(cudaStream_t s) explicit
constructs an execution policy object with the given stream

Public functions

auto stream() -> cudaStream_t noexcept
queries the associated stream
void stream(cudaStream_t stream) noexcept
assigns a stream

Function documentation

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::reduce_bufsz(unsigned count)

queries the buffer size in bytes needed to call reduce kernels

Template parameters
T value type
Parameters
count number of elements to reduce

The function is used to allocate a buffer for calling tf::cuda_reduce, tf::cuda_uninitialized_reduce, tf::cuda_transform_reduce, and tf::cuda_uninitialized_transform_reduce.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::min_element_bufsz(unsigned count)

queries the buffer size in bytes needed to call tf::cuda_min_element

Template parameters
T value type
Parameters
count number of elements to search

The function is used to decide the buffer size in bytes for calling tf::cuda_min_element.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::max_element_bufsz(unsigned count)

queries the buffer size in bytes needed to call tf::cuda_max_element

Template parameters
T value type
Parameters
count number of elements to search

The function is used to decide the buffer size in bytes for calling tf::cuda_max_element.

template<unsigned NT, unsigned VT> template<typename T>
static unsigned tf::cudaExecutionPolicy<NT, VT>::scan_bufsz(unsigned count)

queries the buffer size in bytes needed to call scan kernels

Template parameters
T value type
Parameters
count number of elements to scan

The function is used to allocate a buffer for calling tf::cuda_inclusive_scan, tf::cuda_exclusive_scan, tf::cuda_transform_inclusive_scan, and tf::cuda_transform_exclusive_scan.

template<unsigned NT, unsigned VT>
static unsigned tf::cudaExecutionPolicy<NT, VT>::merge_bufsz(unsigned a_count, unsigned b_count)

queries the buffer size in bytes needed for CUDA merge algorithms

Parameters
a_count number of elements in the first vector to merge
b_count number of elements in the second vector to merge

The buffer size of merge algorithm does not depend on the data type. The buffer is purely used only for storing temporary indices (of type unsigned) required during the merge process.

The function is used to allocate a buffer for calling tf::cuda_merge and tf::cuda_merge_by_key.