Prof David Bindel
Please click the play button below.
Two flavors: dense and sparse
Common structures, no complicated indexing
Stuff not stored in dense form!
15 ops (mostly) on vectors
25 ops (mostly) on matrix/vector pairs
9 ops (mostly) on matrix/matrix
Efficient cache utilization!
LU for \(2 \times 2\): \[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/a & 1 \end{bmatrix} \begin{bmatrix} a & b \\ 0 & d-bc/a \end{bmatrix} \]
Block elimination \[ \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ CA^{-1} & I \end{bmatrix} \begin{bmatrix} A & B \\ 0 & D-CA^{-1}B \end{bmatrix} \]
Block LU \[\begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \\ &= \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{12} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \end{split}\]
Think of \(A\) as \(k \times k\), \(k\) moderate:
[L11,U11] = small_lu(A); % Small block LU
U12 = L11\B; % Triangular solve
L12 = C/U11; % "
S = D-L21*U12; % Rank k update
[L22,U22] = lu(S); % Finish factoring
Three level-3 BLAS calls!
Parallel LA Software for Multicore Architectures
Matrix Algebra for GPU and Multicore Architectures
SLATE???
Much is housed at UTK ICL