CS 5220

GEMV, GEMM, and LU

David Bindel

2024-10-24

Matrix multiply

Matrix vector product

Simple $y = Ax$ involves two indices: \[ y_i = \sum_{j} A_{ij} x_j \] Sums can go in any order!

Matrix vector product

Organize $y = Ax$ around rows or columns:

% Row-oriented
for i = 1:n
  y(i) = A(i,:)*x;
end

% Col-oriented
y = 0;
for j = 1:n
  y = y + A(:,j)*x(j);
end

... or deal with index space in other ways!

Parallel matvec: 1D row-blocked

Broadcast $x$, compute rows independently.

Parallel matvec: 1D row-blocked

Allgather(xlocal, xall)
ylocal = Alocal * xall

Parallel matvec: 1D col-blocked

Compute partial matvecs and reduce.

Parallel matvec: 1D col-blocked

z = Alocal * xlocal
for j = 1:p
    Reduce(sum, z[i], ylocal at proc i)
end

Parallel matvec: 2D blocked

Involves broadcast and reduction
... but with subsets of processors

Parallel matmul

Basic operation: $C = C+AB$
Computation: $2n^3$ flops
Goal: $2n^3/p$ flops per processor, minimal communication
Two main contenders: SUMMA and Cannon

1D layout

Column partitioning of matrices for C += A*B

Block MATLAB notation: $A(:,j)$ means $j$th block column
Processor $j$ owns $A(:,j)$, $B(:,j)$, $C(:,j)$
$C(:,j)$ depends on all of $A$, but only $B(:,j)$
How do we communicate pieces of $A$?

1D layout on ring

Every process $j$ can send data to $j+1$ simultaneously
Pass slices of $A$ around the ring until everyone sees the whole matrix ($p-1$ phases).

1D layout on ring

tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
  sendrecv tmp to myproc+1 mod p, 
           from myproc-1 mod p
  C(myproc) += tmp*B(myproc-j mod p, myproc)

Performance model?

1D layout on ring

In a simple $\alpha-\beta$ model, at each processor:

$p-1$ message sends (and simultaneous receives)
Each message involves $n^2/p$ data
Communication cost: $(p-1) \alpha + (1-1/p) n^2 \beta$

Outer product algorithm

Recall outer product organization:

for k = 0:s-1
  C += A(:,k)*B(k,:);
end

Outer product algorithm

Parallel: Assume $p = s^2$ processors, block $s \times s$ matrices.
For a $2 \times 2$ example: \[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{00} B_{01} \\ A_{10} B_{00} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{01} B_{11} \\ A_{11} B_{10} & A_{11} B_{11} \end{bmatrix} \]

Processor for each $(i,j)$ $\implies$ parallel work for each $k$!
Note everyone in row $i$ uses $A(i,k)$ at once,
and everyone in row $j$ uses $B(k,j)$ at once.

Parallel outer product (SUMMA)

for k = 0:s-1
  for each i in parallel
    broadcast A(i,k) to row
  for each j in parallel
    broadcast A(k,j) to col
  On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end

Parallel outer product (SUMMA)

If we have tree along each row/column, then

$\log(s)$ messages per broadcast
$\alpha + \beta n^2/s^2$ per message
$2 \log(s) (\alpha s + \beta n^2/s)$ total communication
Compare to 1D ring: $(p-1) \alpha + (1-1/p) n^2 \beta$

Note: Same ideas work with block size $b < n/s$

Cannon’s algorithm

SUMMA + “pass it around?”

Cannon’s algorithm

Idea: Reindex products in block matrix multiply \[\begin{aligned} C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\ &= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j) \end{aligned}\] For a fixed $k$, a given block of $A$ (or $B$) is needed for contribution to exactly one $C(i,j)$.

Cannon’s algorithm

\[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}\]

Cannon’s algorithm

% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
  cycle A(i,:) left by i

% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
  cycle B(:,j) up by j

for k = 0 to s-1
  in parallel;
    C(i,j) = C(i,j) + A(i,j)*B(i,j);
  cycle A(:,i) left by 1
  cycle B(:,j) up by 1

Cost of Cannon

Assume 2D torus topology
Initial cyclic shifts: $\leq s$ messages each ($\leq 2s$ total)
For each phase: $2$ messages each ($2s$ total)
Each message is size $n^2/s^2$
Communication cost: $4s(\alpha + \beta n^2/s^2) = 4(\alpha s + \beta n^2/s)$
This communication cost is optimal!
... but SUMMA is simpler, more flexible, almost as good

Speedup and efficiency

Recall \[\begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}\]

Speedup and efficiency

1D layout	$\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}$
SUMMA	$\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}$
Cannon	$\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}$

Assuming no overlap of communication and computation.

LU

Reminder: Evolution of LU

On board... or not.

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \]

Simple LU

Overwrite $A$ with $L$ and $U$

for j = 1:n-1
  for i = j+1:n
    A(i,j) = A(i,j) / A(j,j);    % Compute multiplier
    for k = j+1:n
      A(i,k) -= A(i,j) * A(j,k); % Update row
    end
  end
end

Simple LU

Overwrite $A$ with $L$ and $U$

for j = 1:n-1
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Pivoting

Stability is a problem! Compute $PA = LU$

p = 1:n;
for j = 1:n-1
  [~,jpiv] = max(abs(A(j+1:n,j)));             % Find pivot
  A([j, j+jpiv-1],:) = A([j+jpiv-1, j]);       % Swap pivot row
  p([j, j+jpiv-1],:) = p([j+jpiv-1, j]);       % Save pivot info
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[\begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}\]

Enter pictures

Still haven’t showed how to do pivoting!
Easier to draw diagrams from here
Take 6210 or 4220 if you want more on LU!

Blocked GEPP

Sketch of data accessed for pivot search

Find pivot

Blocked GEPP

Swap pivot row

Blocked GEPP

Update within block column

Blocked GEPP

Delayed update (at end of block)

Big idea

Delayed update strategy lets us do LU fast
- Could have also delayed application of pivots
Same idea with other one-sided factorizations (QR)
Decent multi-core speedup with parallel BLAS!
... assuming $n$ sufficiently large.

Issues left over (block size?)...

Explicit parallelization of GE

What to do:

Decompose into work chunks
Assign work to threads in a balanced way
Orchestrate communication + synchronization
Map which processors execute which threads

Possible matrix layouts

How should we share the matrix across ranks?

1D col blocked

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}\]

1D col cyclic

\[\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}\]

1D col block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}\]

Block skewed

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}\]

2D block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}\]

Possible matrix layouts

1D col blocked: bad load balance
1D col cyclic: hard to use BLAS2/3
1D col block cyclic: factoring col a bottleneck
Block skewed (a la Cannon): just complicated
2D row/col block: bad load balance
2D row/col block cyclic: win!

Distributed GEPP

Find pivot (column broadcast)

Distributed GEPP

Swap pivot row within block column + broadcast pivot

Distributed GEPP

Update within block column

Distributed GEPP

At end of block, broadcast swap info along rows

Distributed GEPP

Apply all row swaps to other columns

Distributed GEPP

Broadcast block $L_{II}$ right

Distributed GEPP

Update remainder of block row

Distributed GEPP

Broadcast rest of block row down

Distributed GEPP

Broadcast rest of block col right

Distributed GEPP

Update of trailing submatrix

Cost of ScaLAPACK GEPP

Communication costs:

Lower bound: $O(n^2/\sqrt{P})$ words, $O(\sqrt{P})$ messages
ScaLAPACK:
- $O(n^2 \log P / \sqrt{P})$ words sent
- $O(n \log p)$ messages
- Problem: reduction to find pivot in each column
Tournaments for stability without partial pivoting

If you don’t care about dense LU?
Let’s review some ideas in a different setting...

Floyd-Warshall

Goal: All pairs shortest path lengths.
Idea: Dynamic programming! Define \[d_{ij}^{(k)} = \mbox{shortest path $i$ to $j$ with intermediates in $\{1, \ldots, k\}$}.\] Then \[d_{ij}^{(k)} = \min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)\] and $d_{ij}^{(n)}$ is the desired shortest path length.

The same and different

Floyd’s algorithm for all-pairs shortest paths:

for k=1:n
  for i = 1:n
    for j = 1:n
      D(i,j) = min(D(i,j), D(i,k)+D(k,j));

Unpivoted Gaussian elimination (overwriting $A$):

for k=1:n
  for i = k+1:n
    A(i,k) = A(i,k) / A(k,k);
    for j = k+1:n
      A(i,j) = A(i,j)-A(i,k)*A(k,j);

The same and different

The same: $O(n^3)$ time, $O(n^2)$ space
The same: can’t move $k$ loop (data dependencies)
- ... at least, can’t without care!
- Different from matrix multiplication
The same: $x_{ij}^{(k)} = f\left(x_{ij}^{(k-1)}, g\left(x_{ik}^{(k-1)}, x_{kj}^{(k-1)}\right)\right)$
- Same basic dependency pattern in updates!
- Similar algebraic relations satisfied
Different: Update to full matrix vs trailing submatrix

How far can we get?

How would we write

Cache-efficient (blocked) serial implementation?
Message-passing parallel implementation?

The full picture could make a fun class project...

Onward!

Next up: Sparse linear algebra and iterative solvers!

1D layout	\(\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}\)
SUMMA	\(\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}\)
Cannon	\(\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}\)