CS 5220

Applications of Parallel Computers

Krylov subspace methods

Prof David Bindel

Please click the play button below.

Goal

Solve \[ Ax = b, \] where \(A\) is sparse (or data sparse).

Krylov Subspace Methods

What if we only know how to multiply by \(A\)?
About all you can do is keep multiplying! \[ \mathcal{K}_k(A,b) = \operatorname{span}\left\{ b, A b, A^2 b, \ldots, A^{k-1} b \right\}. \] Gives surprisingly useful information!

Example: Conjugate Gradients

If \(A\) is symmetric and positive definite,
\(Ax = b\) solves a minimization: \[\begin{aligned} \phi(x) &= \frac{1}{2} x^T A x - x^T b\\ \nabla \phi(x) &= Ax - b. \end{aligned}\] Idea: Minimize \(\phi(x)\) over \(\mathcal{K}_k(A,b)\).
Basis for the method of conjugate gradients

Example: GMRES

Idea: Minimize \(\|Ax-b\|^2\) over \(\mathcal{K}_k(A,b)\).
Yields Generalized Minimum RESidual (GMRES)

Convergence of Krylov Subspace Methods

  • KSPs are not stationary
    (no constant fixed-point iteration)
  • Convergence is surprisingly subtle!
  • CG convergence upper bound via condition number
    • Large condition number iff \(\phi(x)\) is narrow
    • True for Poisson and company

Convergence of Krylov Subspace Methods

  • Preconditioned problem \(M^{-1} A x = M^{-1} b\)
  • Whence \(M\)?
    • From a stationary method?
    • From a simpler/coarser discretization?
    • From approximate factorization?

PCG

r = b-A*x;
p = 0; beta = 0;
z = Msolve(r);
rho = dot(r, z);
for i=1:nsteps
    p = z + beta*p;
    q = A*p;
    alpha = rho/dot(p, q);
    x += alpha*p;
    r -= alpha*q;
    if norm(r) < tol, break; end
    z = Msolve(r);
    rho_prev = rho;
    rho = dot(r, z);
    beta = rho/rho_prev;
end

PCG parallel work

  • Solve with \(M\)
  • Product with \(A\)
  • Dot products and axpys

Pushing PCG

  • Rearrange if \(M = LL^T\) is available
  • Or build around “powers kernel”
    • Old “s-step” approach of Chronopoulos and Gear
    • CA-Krylov of Hoemmen and Demmel
    • Hard to keep stable

Pushing PCG

Two real application levers:

  • Better preconditioning
  • Faster matvecs

PCG bottlenecks

Key: fast solve with \(M\), product with \(A\)

  • Some preconditioners parallelize better!
    (Jacobi vs Gauss-Seidel)
  • Balance speed with performance.
    • Speed for set up of \(M\)?
    • Speed to apply \(M\) after setup?
  • Cheaper to do two multiplies/solves at once...
    • Can’t exploit in obvious way — lose stability
    • Variants allow multiple products (CA-Krylov)
  • Lots of fiddling possible with \(M\); matvec with \(A\)?

Thinking on (basic) CG convergence

Consider 5-point stencil on an \(n \times n\) mesh.

  • Information moves one grid cell per matvec.
  • Cost per matvec is \(O(n^2)\).
  • At least \(O(n^3)\) work to get information across mesh!

Convergence by counting

  • Time to converge \(\geq\) time to move info across
  • For a 2D mesh: \(O(n)\) matvecs, \(O(n^3) = O(N^{3/2})\) cost
  • For a 3D mesh: \(O(n)\) matvecs, \(O(n^4) = O(N^{4/3})\) cost
  • “Long” meshes yield slow convergence

Convergence by counting

3D beats 2D because everything is closer!

  • Advice: sparse direct for 2D, CG for 3D.
  • Better advice: use a preconditioner!

Eigenvalue approach

Define the condition number for \(\kappa(L)\) s.p.d: \[\kappa(L) = \frac{\lambda_{\max}(L)}{\lambda_{\min}(L)}\] Describes how elongated the level surfaces of \(\phi\) are.

Eigenvalue approach

  • For Poisson, \(\kappa(L) = O(h^{-2})\)
  • Steps to halve error: \(O(\sqrt{\kappa}) = O(h^{-1})\).

Similar back-of-the-envelope estimates for some other PDEs. But these are not always that useful... can be pessimistic if there are only a few extreme eigenvalues.

Frequency-domain approach

FFT of e_0 FFT of e_{10}

Error \(e_k\) after \(k\) steps of CG gets smoother!

Preconditioning Poisson

  • CG already handles high-frequency error
  • Want something to deal with lower frequency!
  • Jacobi useless
    • Doesn’t even change Krylov subspace!

Preconditioning Poisson

Better idea: block Jacobi?

  • Q: How should things split up?
  • A: Minimize blocks across domain.
  • Compatible with minimizing communication!

Multiplicative Schwartz

Generalizes block Gauss-Seidel

Restrictive Additive Schwartz (RAS)

  • Get ghost cell data (green)
  • Solve everything local (including neighbor data)
  • Update local values for next step (local)
  • Default strategy in PETSc

Multilevel Ideas

  • RAS moves info one processor per step
  • For scalability, still need to get around this!
  • Basic idea: use multiple grids
    • Fine grid gives lots of work, kills high-freq error
    • Coarse grid cheaply gets info across mesh, kills low freq

Tuning matmul

Can also tune matrix multiply

  • Represented implicitly (regular grids)
    • Example: Optimizing stencil operations (Datta)
  • Or explicitly (e.g. compressed sparse column)
    • Sparse matrix blocking and reordering
    • Packages: Sparsity (Im), OSKI (Vuduc)
    • Available as PETSc extension

Or further rearrange algorithm (Hoemmen, Demmel).

Reminder: Compressed sparse row

for (int i = 0; i < n; ++i) {
  y[i] = 0;
  for (int jj = ptr[i]; jj < ptr[i+1]; ++jj)
    y[i] += A[jj]*x[col[jj]];
}

Problem: y[i] += A[jj]*x[col[jj]];

Memory traffic in CSR multiply

Memory access patterns:

  • Elements of \(y\) accessed sequentially
  • Elements of \(A\) accessed sequentially
  • Access to \(x\) are all over!

Can help by switching to block CSR.
Switching to single precision, short indices can help memory traffic, too!

Parallelizing matvec

  • Each processor gets a piece
  • Many partitioning strategies
  • Idea: re-order so one of these strategies is “good”

Reordering for matvec

SpMV performance goals:

  • Balance load?
  • Balance storage?
  • Minimize communication?
  • Good cache re-use?

Reordering also comes up for GE!