CS 5220

Applications of Parallel Computers

Parallel matrix multiply

Prof David Bindel

Please click the play button below.

Matrix vector product

Simple \(y = Ax\) involves two indices: \[ y_i = \sum_{j} A_{ij} x_j \] Sums can go in any order!

Matrix vector product

Organize \(y = Ax\) around rows or columns:

% Row-oriented
for i = 1:n
  y(i) = A(i,:)*x;
end

% Col-oriented
y = 0;
for j = 1:n
  y = y + A(:,j)*x(j);
end

... or deal with index space in other ways!

Parallel matvec: 1D row-blocked

Broadcast \(x\), compute rows independently.

Parallel matvec: 1D row-blocked

Allgather(xlocal, xall)
ylocal = Alocal * xall

Parallel matvec: 1D col-blocked

Compute partial matvecs and reduce.

Parallel matvec: 1D col-blocked

z = Alocal * xlocal
for j = 1:p
    Reduce(sum, z[i], ylocal at proc i)
end

Parallel matvec: 2D blocked

  • Involves broadcast and reduction
  • ... but with subsets of processors

Parallel matmul

  • Basic operation: \(C = C+AB\)
  • Computation: \(2n^3\) flops
  • Goal: \(2n^3/p\) flops per processor, minimal communication
  • Two main contenders: SUMMA and Cannon

1D layout

  • Block MATLAB notation: \(A(:,j)\) means \(j\)th block column
  • Processor \(j\) owns \(A(:,j)\), \(B(:,j)\), \(C(:,j)\)
  • \(C(:,j)\) depends on all of \(A\), but only \(B(:,j)\)
  • How do we communicate pieces of \(A\)?

1D layout on ring

  • Every process \(j\) can send data to \(j+1\) simultaneously
  • Pass slices of \(A\) around the ring until everyone sees the whole matrix (\(p-1\) phases).

1D layout on ring

tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
  sendrecv tmp to myproc+1 mod p, 
           from myproc-1 mod p
  C(myproc) += tmp*B(myproc-j mod p, myproc)

Performance model?

1D layout on ring

In a simple \(\alpha-\beta\) model, at each processor:

  • \(p-1\) message sends (and simultaneous receives)
  • Each message involves \(n^2/p\) data
  • Communication cost: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Outer product algorithm

Recall outer product organization:

for k = 0:s-1
  C += A(:,k)*B(k,:);
end

Outer product algorithm

Parallel: Assume \(p = s^2\) processors, block \(s \times s\) matrices.
For a \(2 \times 2\) example: \[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{00} B_{01} \\ A_{10} B_{00} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{01} B_{11} \\ A_{11} B_{10} & A_{11} B_{11} \end{bmatrix} \]

  • Processor for each \((i,j)\) \(\implies\) parallel work for each \(k\)!
  • Note everyone in row \(i\) uses \(A(i,k)\) at once,
    and everyone in row \(j\) uses \(B(k,j)\) at once.

Parallel outer product (SUMMA)

for k = 0:s-1
  for each i in parallel
    broadcast A(i,k) to row
  for each j in parallel
    broadcast A(k,j) to col
  On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end

Parallel outer product (SUMMA)

If we have tree along each row/column, then

  • \(\log(s)\) messages per broadcast
  • \(\alpha + \beta n^2/s^2\) per message
  • \(2 \log(s) (\alpha s + \beta n^2/s)\) total communication
  • Compare to 1D ring: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Note: Same ideas work with block size \(b < n/s\)

Cannon’s algorithm

SUMMA + “pass it around?”

Cannon’s algorithm

Idea: Reindex products in block matrix multiply \[\begin{aligned} C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\ &= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j) \end{aligned}\] For a fixed \(k\), a given block of \(A\) (or \(B\)) is needed for contribution to exactly one \(C(i,j)\).

Cannon’s algorithm

\[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}\]

Cannon’s algorithm

% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
  cycle A(i,:) left by i

% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
  cycle B(:,j) up by j

for k = 0 to s-1
  in parallel;
    C(i,j) = C(i,j) + A(i,j)*B(i,j);
  cycle A(:,i) left by 1
  cycle B(:,j) up by 1

Cost of Cannon

  • Assume 2D torus topology
  • Initial cyclic shifts: \(\leq s\) messages each (\(\leq 2s\) total)
  • For each phase: \(2\) messages each (\(2s\) total)
  • Each message is size \(n^2/s^2\)
  • Communication cost: \(4s(\alpha + \beta n^2/s^2) = 4(\alpha s + \beta n^2/s)\)
  • This communication cost is optimal!
    ... but SUMMA is simpler, more flexible, almost as good

Speedup and efficiency

Recall \[\begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}\]

Speedup and efficiency

1D layout \(\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}\)
SUMMA \(\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}\)
Cannon \(\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}\)

Assuming no overlap of communication and computation.