CS 5220

Applications of Parallel Computers

Parallel matrix multiply

Please click the play button below.

Matrix vector product

Simple \(y = Ax\) involves two indices: \[ y_i = \sum_{j} A_{ij} x_j \] Sums can go in any order!

Matrix vector product

Organize \(y = Ax\) around rows or columns:

% Row-oriented
for i = 1:n
  y(i) = A(i,:)*x;
end

% Col-oriented
y = 0;
for j = 1:n
  y = y + A(:,j)*x(j);
end

... or deal with index space in other ways!

Parallel matvec: 1D row-blocked

Row-partitioned matrix and vectors

Broadcast \(x\), compute rows independently.

Parallel matvec: 1D row-blocked

Row-partitioned matrix and vectors

Allgather(xlocal, xall)
ylocal = Alocal * xall

Parallel matvec: 1D col-blocked

Col-partitioned matrix and vectors

Compute partial matvecs and reduce.

Parallel matvec: 1D col-blocked

Col-partitioned matrix and vectors

z = Alocal * xlocal
for j = 1:p
    Reduce(sum, z[i], ylocal at proc i)
end

Parallel matvec: 2D blocked

Involves broadcast and reduction
... but with subsets of processors

Parallel matmul

Basic operation: \(C = C+AB\)
Computation: \(2n^3\) flops
Goal: \(2n^3/p\) flops per processor, minimal communication
Two main contenders: SUMMA and Cannon

1D layout

Column partitioning of matrices for C += A*B

Block MATLAB notation: \(A(:,j)\) means \(j\)th block column
Processor \(j\) owns \(A(:,j)\), \(B(:,j)\), \(C(:,j)\)
\(C(:,j)\) depends on all of \(A\), but only \(B(:,j)\)
How do we communicate pieces of \(A\)?

1D layout on ring

Column partitioning of matrices for C += A*B

Every process \(j\) can send data to \(j+1\) simultaneously
Pass slices of \(A\) around the ring until everyone sees the whole matrix (\(p-1\) phases).

1D layout on ring

tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
  sendrecv tmp to myproc+1 mod p, 
           from myproc-1 mod p
  C(myproc) += tmp*B(myproc-j mod p, myproc)

Performance model?

1D layout on ring

In a simple \(\alpha-\beta\) model, at each processor:

\(p-1\) message sends (and simultaneous receives)
Each message involves \(n^2/p\) data
Communication cost: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Outer product algorithm

Recall outer product organization:

for k = 0:s-1
  C += A(:,k)*B(k,:);
end

Outer product algorithm

Parallel: Assume \(p = s^2\) processors, block \(s \times s\) matrices.
For a \(2 \times 2\) example: \[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{00} B_{01} \\ A_{10} B_{00} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{01} B_{11} \\ A_{11} B_{10} & A_{11} B_{11} \end{bmatrix} \]

Processor for each \((i,j)\) \(\implies\) parallel work for each \(k\)!
Note everyone in row \(i\) uses \(A(i,k)\) at once,
and everyone in row \(j\) uses \(B(k,j)\) at once.

Parallel outer product (SUMMA)

for k = 0:s-1
  for each i in parallel
    broadcast A(i,k) to row
  for each j in parallel
    broadcast A(k,j) to col
  On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end

Parallel outer product (SUMMA)

If we have tree along each row/column, then

\(\log(s)\) messages per broadcast
\(\alpha + \beta n^2/s^2\) per message
\(2 \log(s) (\alpha s + \beta n^2/s)\) total communication
Compare to 1D ring: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Note: Same ideas work with block size \(b < n/s\)

Cannon’s algorithm

SUMMA + “pass it around?”

Cannon’s algorithm

Idea: Reindex products in block matrix multiply \[\begin{aligned} C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\ &= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j) \end{aligned}\] For a fixed \(k\), a given block of \(A\) (or \(B\)) is needed for contribution to exactly one \(C(i,j)\).

Cannon’s algorithm

\[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}\]

Cannon’s algorithm

% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
  cycle A(i,:) left by i

% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
  cycle B(:,j) up by j

for k = 0 to s-1
  in parallel;
    C(i,j) = C(i,j) + A(i,j)*B(i,j);
  cycle A(:,i) left by 1
  cycle B(:,j) up by 1

Cost of Cannon

Assume 2D torus topology
Initial cyclic shifts: \(\leq s\) messages each (\(\leq 2s\) total)
For each phase: \(2\) messages each (\(2s\) total)
Each message is size \(n^2/s^2\)
Communication cost: \(4s(\alpha + \beta n^2/s^2) = 4(\alpha s + \beta n^2/s)\)
This communication cost is optimal!
... but SUMMA is simpler, more flexible, almost as good

Speedup and efficiency

Recall \[\begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}\]

Speedup and efficiency

1D layout	\(\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}\)
SUMMA	\(\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}\)
Cannon	\(\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}\)

Assuming no overlap of communication and computation.