Matrix Multiplication
We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it parallel CPUs.
Problem Formulation
We are multiplying two matrices, A
(MxK
) and B
(KxN
). The numbers of columns of A
must match the number of rows of B
. The output matrix C
has the shape of (MxN) where M
is the rows of A
and N
the columns of B
. The following example multiplies a 3x3
matrix with a 3x2
matrix to derive a 3x2
matrix.
As a general view, for each element of C
we iterate a complete row of A
and a complete column of B
, multiplying each element and summing them.
We can implement matrix multiplication using three nested loops.
for(int m=0; m<M; m++) { for(int n=0; n<N; n++) { C[m][n] = 0; for(int k=0; k<K; k++) { C[m][n] += A[m][k] * B[k][n]; } } }
Parallel Patterns
At a fine-grained level, computing each element of C
is independent of each other. Similarly, computing each row of C
or each column of C
is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C
to multiply a row of A
by every column of B
.
// C = A * B // A is a MxK matrix, B is a KxN matrix, and C is a MxN matrix void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) { tf::Taskflow taskflow; tf::Executor executor; for(int m=0; m<M; ++m) { taskflow.emplace([m, &] () { for(int n=0; n<N; n++) { for(int k=0; k<K; k++) { C[m][n] += A[m][k] * B[k][n]; // inner product } } }); } executor.run(taskflow).wait(); }
Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::
// perform parallel iterations on the range [0, M) with the step size of 1 tf::Task task = taskflow.for_each_index(0, M, 1, [&] (int m) { for(int n=0; n<N; n++) { for(int k=0; k<K; k++) { C[m][n] += A[m][k] * B[k][n]; } } });
Please visit Parallel Iterations for more details.
Benchmarking
Based on the discussion above, we compare the runtime of computing various matrix sizes of A
, B
, and C
between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.
A | B | C | CPU Sequential | CPU Parallel |
---|---|---|---|---|
10x10 | 10x10 | 10x10 | 0.142 ms | 0.414 ms |
100x100 | 100x100 | 100x100 | 1.641 ms | 0.733 ms |
1000x1000 | 1000x1000 | 1000x1000 | 1532 ms | 504 ms |
2000x2000 | 2000x2000 | 2000x2000 | 25688 ms | 4387 ms |
3000x3000 | 3000x3000 | 3000x3000 | 104838 ms | 16170 ms |
4000x4000 | 4000x4000 | 4000x4000 | 250133 ms | 39646 ms |
The speed-up of parallel execution becomes clean as we increase the problem size. For example, at 4000x4000
, the parallel runtime is 6.3 times faster than the sequential runtime.