Learning from Examples » Matrix Multiplication

We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it parallel CPUs.

Problem Formulation

We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.

Image

As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.

Image

We can implement matrix multiplication using three nested loops.

for(int m=0; m<M; m++) {
  for(int n=0; n<N; n++) {
    C[m][n] = 0;
    for(int k=0; k<K; k++) {
      C[m][n] += A[m][k] * B[k][n];
    }
  }
}

Parallel Patterns

At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B.

// C = A * B
// A is a MxK matrix, B is a KxN matrix, and C is a MxN matrix
void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) {
  tf::Taskflow taskflow;
  tf::Executor executor;
  for(int m=0; m<M; ++m) {
    taskflow.emplace([m, &] () {
      for(int n=0; n<N; n++) {
        for(int k=0; k<K; k++) {
          C[m][n] += A[m][k] * B[k][n];  // inner product
        }
      }
    });
  }
  executor.run(taskflow).wait();
}

Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::for_each_index to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range.

// perform parallel iterations on the range [0, M) with the step size of 1
tf::Task task = taskflow.for_each_index(0, M, 1, [&] (int m) {
  for(int n=0; n<N; n++) {
    for(int k=0; k<K; k++) {
      C[m][n] += A[m][k] * B[k][n];
    }   
  }   
}); 

Please visit Parallel Iterations for more details.

Benchmarking

Based on the discussion above, we compare the runtime of computing various matrix sizes of A, B, and C between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.

ABCCPU SequentialCPU Parallel
10x1010x1010x100.142 ms0.414 ms
100x100100x100100x1001.641 ms0.733 ms
1000x10001000x10001000x10001532 ms504 ms
2000x20002000x20002000x200025688 ms4387 ms
3000x30003000x30003000x3000104838 ms16170 ms
4000x40004000x40004000x4000250133 ms39646 ms

The speed-up of parallel execution becomes clean as we increase the problem size. For example, at 4000x4000, the parallel runtime is 6.3 times faster than the sequential runtime.