CS 5220

Dense LA

David Bindel

2024-10-22

Logistics

Special talk!

CS colloquium / Salton lecture
Thursday, Oct 22, 11:45-12:45
Gates G01

Speaker: Jack Dongarra
Title: An Overview of High Performance Computing and Responsibly Reckless Algorithms

Please come out!

Participation

Trying a few things:

  • More small-group discussion
  • Poll Everywhere
  • CMS quizzes

Open to suggestions! Help me make this time most useful to you.

Projects 1 and 2

  • Project 1 is graded
    • NB re ChatGPT - doesn’t seem to have helped
  • Project 2 extended to 10/24
    • Wrapping earlier is good!
    • Do spend time on the write-up
    • P2 peer evals are due 10/25

Homework 3

Project 3: Shallow water

Final project

Final project basics

  • Scope
    • Analogous to other projects
    • Groups of 1-5
    • Must be related to performance
  • Platform
    • Does not need to be on Perlmutter
    • Or even parallel
    • Or even involve new code!

Final project ideas

From my own work:

  • Parallel optimization (variable evaluation time)
  • Smart GPU memory in a GPU-accelerated time stepper
  • Blocked relaxation methods for kernel systems
  • Bayesian optimization and auto-tuning
  • Mixed-precision linear algebra operations

Final project ideas

  • From this class
    • Previous semester projects
    • Current projects on other platforms
  • PDE solves, dynamic programming (shortest path), etc
  • Tree operations
  • Things from your research

Final project ideas

“But I want to parallelize ML workloads!”

  • Knock yourself out
    • I’m interested in training constitutive models…
  • Be aware of what course compute resources we have
  • Compare to a fair baseline

Final project timeline

  • Proposal due 10/31
  • Final outline 11/26 (just before Thanksgiving)
  • Project presentations 12/3-4
  • Final deliverable due 12/19 at 12 PM

Dense LA

Where we are

  • This week: dense linear algebra
  • Next week: sparse linear algebra

Nutshell NLA

  • Linear systems: \(Ax = b\)
  • Least squares: minimize \(\|Ax-b\|_2^2\)
  • Eigenvalues: \(Ax = \lambda x\)

Nutshell NLA

  • Basic paradigm: matrix factorization
    • \(A = LU\), \(A = LL^T\)
    • \(A = QR\)
    • \(A = V \Lambda V^{-1}\), \(A = Q T Q^T\)
    • \(A = U \Sigma V^T\)
  • Factorization \(\equiv\) switch to basis that makes problem easy

Dense and sparse

Two flavors: dense and sparse

Dense

Common structures, no complicated indexing

  • General dense (all entries nonzero)
  • Banded (zero below/above some diagonal)
  • Symmetric/Hermitian
  • Standard, robust algorithms (LAPACK)

Sparse

Stuff not stored in dense form!

  • Maybe few nonzeros
    • e.g. compressed sparse row formats
  • May be implicit (e.g. via finite differencing)
  • May be “dense”, but with compact repn (e.g. via FFT)
  • Mostly iterations; varied and subtle
  • Build on dense ideas

History: BLAS 1 (1973–1977)

15 ops (mostly) on vectors

  • BLAS 1 == \(O(n^1)\) ops on \(O(n^1)\) data
  • Up to four versions of each: S/D/C/Z
  • Example: DAXPY
    • Double precision (real)
    • Computes \(Ax+y\)

BLAS 1 goals

  • Raise level of programming abstraction
  • Robust implementation (e.g. avoid over/underflow)
  • Portable interface, efficient machine-specific implementation
  • Used in LINPACK (and EISPACK?)

History: BLAS 2 (1984–1986)

25 ops (mostly) on matrix/vector pairs

  • BLAS2 == \(O(n^2)\) ops on \(O(n^2)\) data
  • Different data types and matrix types
  • Example: DGEMV
    • Double precision
    • GEneral matrix
    • Matrix-Vector product

BLAS 2 goals

  • BLAS1 insufficient
  • BLAS2 for better vectorization (when vector machines roamed)

History: BLAS 3 (1987–1988)

9 ops (mostly) on matrix/matrix

  • BLAS3 == \(O(n^3)\) ops on \(O(n^2)\) data
  • Different data types and matrix types
  • Example: DGEMM
    • Double precision
    • GEneral matrix
    • Matrix-Matrix product

BLAS 3 goals

Efficient cache utilization!

Why BLAS?

LU for \(2 \times 2\): \[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/a & 1 \end{bmatrix} \begin{bmatrix} a & b \\ 0 & d-bc/a \end{bmatrix} \]

Why BLAS?

Block elimination \[ \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ CA^{-1} & I \end{bmatrix} \begin{bmatrix} A & B \\ 0 & D-CA^{-1}B \end{bmatrix} \]

Why BLAS?

Block LU \[\begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \\ &= \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{12} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \end{split}\]

Why BLAS?

Think of \(A\) as \(k \times k\), \(k\) moderate:

[L11,U11] = small_lu(A);   % Small block LU
U12 = L11\B;               % Triangular solve
L12 = C/U11;               % "
S   = D-L21*U12;           % Rank k update
[L22,U22] = lu(S);         % Finish factoring

Three level-3 BLAS calls!

  • Two triangular solves
  • One rank-\(k\) update

LAPACK (1989-present)

  • Supercedes earlier LINPACK and EISPACK
  • High performance through BLAS
    • Parallel to the extent BLAS are parallel (on SMP)
    • Linear systems, least squares – near 100% BLAS 3
    • Eigenproblems, SVD — only about 50% BLAS 3
  • Careful error bounds on everything
  • Lots of variants for different structures

ScaLAPACK (1995-present)

  • MPI implementations
  • Only a small subset of LAPACK functionality

PLASMA (2008-present)

Parallel LA Software for Multicore Architectures

  • Target: Shared memory multiprocessors
  • Stacks on LAPACK/BLAS interfaces
  • Tile algorithms, tile data layout, dynamic scheduling
  • Other algorithmic ideas, too (randomization, etc)

MAGMA (2008-present)

Matrix Algebra for GPU and Multicore Architectures

  • Target: CUDA, OpenCL, Xeon Phi
  • Still stacks (e.g. on CUDA BLAS)
  • Again: tile algorithms + data, dynamic scheduling
  • Mixed precision algorithms (+ iterative refinement)

And beyond!

SLATE???

Much is housed at UTK ICL