GEMV, GEMM, and LU
2024-10-24
Simple \(y = Ax\) involves two indices: \[ y_i = \sum_{j} A_{ij} x_j \] Sums can go in any order!
Organize \(y = Ax\) around rows or columns:
% Row-oriented
for i = 1:n
y(i) = A(i,:)*x;
end
% Col-oriented
y = 0;
for j = 1:n
y = y + A(:,j)*x(j);
end
... or deal with index space in other ways!
Broadcast \(x\), compute rows independently.
Allgather(xlocal, xall)
ylocal = Alocal * xall
Compute partial matvecs and reduce.
z = Alocal * xlocal
for j = 1:p
Reduce(sum, z[i], ylocal at proc i)
end
tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
sendrecv tmp to myproc+1 mod p,
from myproc-1 mod p
C(myproc) += tmp*B(myproc-j mod p, myproc)
Performance model?
In a simple \(\alpha-\beta\) model, at each processor:
Recall outer product organization:
for k = 0:s-1
C += A(:,k)*B(k,:);
end
Parallel: Assume \(p = s^2\) processors, block \(s \times s\) matrices.
For a \(2 \times 2\) example: \[\begin{bmatrix}
C_{00} & C_{01} \\
C_{10} & C_{11}
\end{bmatrix} =
\begin{bmatrix}
A_{00} B_{00} & A_{00} B_{01} \\
A_{10} B_{00} & A_{10} B_{01}
\end{bmatrix} +
\begin{bmatrix}
A_{01} B_{10} & A_{01} B_{11} \\
A_{11} B_{10} & A_{11} B_{11}
\end{bmatrix}
\]
for k = 0:s-1
for each i in parallel
broadcast A(i,k) to row
for each j in parallel
broadcast A(k,j) to col
On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end
If we have tree along each row/column, then
Note: Same ideas work with block size \(b < n/s\)
SUMMA + “pass it around?”
Idea: Reindex products in block matrix multiply \[\begin{aligned} C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\ &= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j) \end{aligned}\] For a fixed \(k\), a given block of \(A\) (or \(B\)) is needed for contribution to exactly one \(C(i,j)\).
\[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}\]
% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
cycle A(i,:) left by i
% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
cycle B(:,j) up by j
for k = 0 to s-1
in parallel;
C(i,j) = C(i,j) + A(i,j)*B(i,j);
cycle A(:,i) left by 1
cycle B(:,j) up by 1
Recall \[\begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}\]
1D layout | \(\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}\) |
SUMMA | \(\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}\) |
Cannon | \(\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}\) |
Assuming no overlap of communication and computation.
On board... or not.
\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right) \]
\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right) \]
\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \]
Overwrite \(A\) with \(L\) and \(U\)
for j = 1:n-1
for i = j+1:n
A(i,j) = A(i,j) / A(j,j); % Compute multiplier
for k = j+1:n
A(i,k) -= A(i,j) * A(j,k); % Update row
end
end
end
Overwrite \(A\) with \(L\) and \(U\)
for j = 1:n-1
A(j+1:n,j) = A(j+1:n,j)/A(j,j); % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n); % Trailing update
end
Stability is a problem! Compute \(PA = LU\)
p = 1:n;
for j = 1:n-1
[~,jpiv] = max(abs(A(j+1:n,j))); % Find pivot
A([j, j+jpiv-1],:) = A([j+jpiv-1, j]); % Swap pivot row
p([j, j+jpiv-1],:) = p([j+jpiv-1, j]); % Save pivot info
A(j+1:n,j) = A(j+1:n,j)/A(j,j); % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n); % Trailing update
end
Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \]
Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \]
Think in a way that uses level 3 BLAS \[\begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}\]
Find pivot
Swap pivot row
Update within block column
Delayed update (at end of block)
Issues left over (block size?)...
What to do:
How should we share the matrix across ranks?
\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}\]
\[\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}\]
\[\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}\]
\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}\]
\[\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}\]
Find pivot (column broadcast)
Swap pivot row within block column + broadcast pivot
Update within block column
At end of block, broadcast swap info along rows
Apply all row swaps to other columns
Broadcast block \(L_{II}\) right
Update remainder of block row
Broadcast rest of block row down
Broadcast rest of block col right
Update of trailing submatrix
Communication costs:
If you don’t care about dense LU?
Let’s review some ideas in a different setting...
Goal: All pairs shortest path lengths.
Idea: Dynamic programming! Define \[d_{ij}^{(k)} =
\mbox{shortest path $i$ to $j$ with intermediates in $\{1, \ldots, k\}$}.\] Then \[d_{ij}^{(k)} =
\min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)\] and \(d_{ij}^{(n)}\) is the desired shortest path length.
Floyd’s algorithm for all-pairs shortest paths:
for k=1:n
for i = 1:n
for j = 1:n
D(i,j) = min(D(i,j), D(i,k)+D(k,j));
Unpivoted Gaussian elimination (overwriting \(A\)):
for k=1:n
for i = k+1:n
A(i,k) = A(i,k) / A(k,k);
for j = k+1:n
A(i,j) = A(i,j)-A(i,k)*A(k,j);
How would we write
The full picture could make a fun class project...
Next up: Sparse linear algebra and iterative solvers!