CS 5220

Applications of Parallel Computers

Parallel LU

Please click the play button below.

Reminder: Evolution of LU

On board... or not.

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \]

We could get from our upper triangular matrix back to the original matrix by “undoing” the operations that occurred during elimination. We got the third row of U by taking the last row of A and subtracting seven times the first row in the first step, and twice the second row in the second step. To recover the last row of A, we take the last row of U and add back twice the second row of U and seven times the first row of U. Et voila! Similarly, we can reconstruct the second row of A from the first two rows of U; and the first row of A is the same as the first row of U.

What we have just said in words is that Gaussian elimination can really be seen as factoring A into a unit lower triangular matrix L containing the multipliers in the algorithm (remember, this means lower triangular with ones on the main diagonal) times an upper triangular matrix U. This is the so-called LU decomposition.

Simple LU

Overwrite $A$ with $L$ and $U$

for j = 1:n-1
  for i = j+1:n
    A(i,j) = A(i,j) / A(j,j);    % Compute multiplier
    for k = j+1:n
      A(i,k) -= A(i,j) * A(j,k); % Update row
    end
  end
end

Simple LU

Overwrite $A$ with $L$ and $U$

for j = 1:n-1
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Pivoting

Stability is a problem! Compute $PA = LU$

p = 1:n;
for j = 1:n-1
  [~,jpiv] = max(abs(A(j+1:n,j)));             % Find pivot
  A([j, j+jpiv-1],:) = A([j+jpiv-1, j]);       % Swap pivot row
  p([j, j+jpiv-1],:) = p([j+jpiv-1, j]);       % Save pivot info
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[\begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}\]

Enter pictures

Still haven’t showed how to do pivoting!
Easier to draw diagrams from here
Take 6210 or 4220 if you want more on LU!

Blocked GEPP

Sketch of data accessed for pivot search

Find pivot

Blocked GEPP

Sketch of data accessed for pivot

Swap pivot row

Blocked GEPP

Sketch of block column update

Update within block column

Blocked GEPP

Sketch of delayed update

Delayed update (at end of block)

Big idea

Delayed update strategy lets us do LU fast
- Could have also delayed application of pivots
Same idea with other one-sided factorizations (QR)
Decent multi-core speedup with parallel BLAS!
... assuming $n$ sufficiently large.

Issues left over (block size?)...

Explicit parallelization of GE

What to do:

Decompose into work chunks
Assign work to threads in a balanced way
Orchestrate communication + synchronization
Map which processors execute which threads

Possible matrix layouts

How should we share the matrix across ranks?

1D col blocked

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}\]

1D col cyclic

\[\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}\]

1D col block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}\]

Block skewed

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}\]

2D block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}\]

Possible matrix layouts

1D col blocked: bad load balance
1D col cyclic: hard to use BLAS2/3
1D col block cyclic: factoring col a bottleneck
Block skewed (a la Cannon): just complicated
2D row/col block: bad load balance
2D row/col block cyclic: win!

Distributed GEPP

Sketch of data accessed for pivot search

Find pivot (column broadcast)

Distributed GEPP

Sketch of data for pivot swap

Swap pivot row within block column + broadcast pivot

Distributed GEPP

Sketch of block column update

Update within block column

Distributed GEPP

Sketch of swap broadcast

At end of block, broadcast swap info along rows

Distributed GEPP

Sketch of swap application

Apply all row swaps to other columns

Distributed GEPP

Sketch of L block broadcast

Broadcast block $L_{II}$ right

Distributed GEPP

Sketch of row update

Update remainder of block row

Distributed GEPP

Sketch of block row broadcast

Broadcast rest of block row down

Distributed GEPP

Sketch of block col broadcast

Broadcast rest of block col right

Distributed GEPP

Sketch of Schur complement update

Update of trailing submatrix

Cost of ScaLAPACK GEPP

Communication costs:

Lower bound: $O(n^2/\sqrt{P})$ words, $O(\sqrt{P})$ messages
ScaLAPACK:
- $O(n^2 \log P / \sqrt{P})$ words sent
- $O(n \log p)$ messages
- Problem: reduction to find pivot in each column
Tournaments for stability without partial pivoting

If you don’t care about dense LU?
Let’s review some ideas in a different setting...

The algorithm that we just described is within a log factor of optimal in the amount of data sent, but it ends up sending a lot of messages. Part of the reason for the number of messages is that at every step, we have to apply a collective op in order to find the pivot row! Alternate pivoting strategies have been developed that involve less communication, using tournaments to find a “good enough” pivot at each step; but I’m not going to go into that now.

Some of you may be asking at this point “why have I just listened to this many slides about dense Gaussian elimination?” You may feel that you don’t care that much about linear algebra, sad as it makes me to admit such a possibility. But maybe you still care about this computational pattern, because it shows up in other places, too.

Floyd-Warshall

Goal: All pairs shortest path lengths.
Idea: Dynamic programming! Define \[d_{ij}^{(k)} = \mbox{shortest path $i$ to $j$ with intermediates in $\{1, \ldots, k\}$}.\] Then \[d_{ij}^{(k)} = \min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)\] and $d_{ij}^{(n)}$ is the desired shortest path length.

The same and different

Floyd’s algorithm for all-pairs shortest paths:

for k=1:n
  for i = 1:n
    for j = 1:n
      D(i,j) = min(D(i,j), D(i,k)+D(k,j));

Unpivoted Gaussian elimination (overwriting $A$):

for k=1:n
  for i = k+1:n
    A(i,k) = A(i,k) / A(k,k);
    for j = k+1:n
      A(i,j) = A(i,j)-A(i,k)*A(k,j);

The same and different

The same: $O(n^3)$ time, $O(n^2)$ space
The same: can’t move $k$ loop (data dependencies)
- ... at least, can’t without care!
- Different from matrix multiplication
The same: $x_{ij}^{(k)} = f\left(x_{ij}^{(k-1)}, g\left(x_{ik}^{(k-1)}, x_{kj}^{(k-1)}\right)\right)$
- Same basic dependency pattern in updates!
- Similar algebraic relations satisfied
Different: Update to full matrix vs trailing submatrix

How far can we get?

How would we write

Cache-efficient (blocked) serial implementation?
Message-passing parallel implementation?

The full picture could make a fun class project...

Onward!

Next up: Sparse linear algebra and iterative solvers!