We implement a parallel sparse matrix-vector multiplication (SpMV) using tf::Taskflow::for_each_by_index. This example highlights why irregular workloads like SpMV require careful partitioner selection and demonstrates how different partitioners affect load balance across workers.
Many scientific and engineering problems involve matrices where the vast majority of entries are zero. Examples include finite element stiffness matrices, graph adjacency matrices, network flow problems, and PageRank computations. Storing and operating on all N*N entries of such a matrix is wasteful; instead, only the non-zero entries are stored and processed.
SpMV computes the product y=A*x, where A is a sparse matrix, x is a dense input vector, and y is a dense output vector. For each row i of A, the result is the dot product of that row with x:
SpMV is one of the most important kernels in scientific computing. It appears in iterative linear solvers (conjugate gradient, GMRES), graph analytics (PageRank, label propagation), and machine learning (sparse neural networks, graph neural networks). Making SpMV fast on multi-core CPUs is therefore broadly valuable.
The most widely used sparse matrix format is Compressed Sparse Row (CSR). CSR stores only the non-zero values and their column positions, organized row by row. It uses three arrays:
values: the non-zero values in row-major ordercol_idx: the column index of each non-zero valuerow_ptr: the starting position in values and col_idx for each rowThe number of non-zeros in row i is row_ptr[i+1] - row_ptr[i], and the non-zeros themselves are at positions row_ptr[i] through row_ptr[i+1]-1 in values and col_idx. To make this concrete, consider the following 5x5 sparse matrix:
Its CSR representation is:
To compute y[2] for example, the non-zeros in row 2 start at row_ptr[2]=4 and end before row_ptr[3]=6. So y[2] = values[4]*x[col_idx[4]] + values[5]*x[col_idx[5]] = 1*x[0] + 4*x[2]. This format is efficient for SpMV because each row can be processed independently: row i reads from x at the column positions it needs and writes a single scalar to y[i]. There are no write conflicts between rows, making all rows safe to process in parallel.
The key challenge of parallelizing SpMV is that rows have different numbers of non-zeros. In real sparse matrices this variation can be extreme: a web graph has a few hub pages with millions of links and millions of ordinary pages with just a handful. A matrix from an adaptive mesh may have a few boundary rows with many neighbours and many interior rows with just six.
For matrices with uniform row lengths (banded matrices, regular meshes), static partitioning is perfectly fine and has lower overhead. For matrices with highly variable row lengths, guided or dynamic partitioning is the better choice.
The complete self-contained example below constructs a small CSR matrix by hand, runs SpMV in parallel with three different partitioners, and verifies that all three produce the same result:
The expected output for x = {1, 2, 3, 4, 5} is:
All three partitioners produce identical results; the partitioner only affects how rows are distributed among workers, not the computation itself. You can verify the first row manually: y[0] = 3*x[0] + 1*x[3] = 3*1 + 1*4 = 7.
There are a few important design points worth noting for this example, which also apply generally to parallel sparse linear algebra algorithms:
i writes exclusively to y[i]. Because two rows never write to the same output element, no atomic operations or locks are needed on y, regardless of which rows are processed simultaneously.x: Multiple workers read from x concurrently but never write to it during the sweep. Concurrent reads from a std::vector are safe without synchronization.k accesses values and col_idx sequentially, which is cache-friendly within a single row. However, the gather access x[col_idx[k]] is irregular since column indices can point anywhere in x, and this is typically the memory bandwidth bottleneck of SpMV on large sparse matrices.