CS 5220

Applications of Parallel Computers

Krylov subspace methods

Please click the play button below.

Goal

Solve \[ Ax = b, \] where $A$ is sparse (or data sparse).

Krylov Subspace Methods

What if we only know how to multiply by $A$?
About all you can do is keep multiplying! \[ \mathcal{K}_k(A,b) = \operatorname{span}\left\{ b, A b, A^2 b, \ldots, A^{k-1} b \right\}. \] Gives surprisingly useful information!

Example: Conjugate Gradients

If $A$ is symmetric and positive definite,
$Ax = b$ solves a minimization: \[\begin{aligned} \phi(x) &= \frac{1}{2} x^T A x - x^T b\\ \nabla \phi(x) &= Ax - b. \end{aligned}\] Idea: Minimize $\phi(x)$ over $\mathcal{K}_k(A,b)$.
Basis for the method of conjugate gradients

Example: GMRES

Idea: Minimize $\|Ax-b\|^2$ over $\mathcal{K}_k(A,b)$.
Yields Generalized Minimum RESidual (GMRES)

Convergence of Krylov Subspace Methods

KSPs are not stationary
(no constant fixed-point iteration)
Convergence is surprisingly subtle!
CG convergence upper bound via condition number
- Large condition number iff $\phi(x)$ is narrow
- True for Poisson and company

The convergence of Krylov subspace methods is surprisingly subtle. These aren’t fixed point iterations, and so we can’t use that theory. Moreover, iterations like CG converge to the true solution in a finite number of steps — in exact arithmetic, at least, the behavior differs in floating point — and so asymptotic statements have to be treated with some care.

The usual way we talk about convergence of methods like CG is via the condition number of the problem: the ratio of the largest to the smallest eigenvalue problem. In the optimization formulation, having a large condition number corresponds to finding the bottom of an elongated bowl rather than a round one. Problems like our model Poisson problem tend to be ill-conditioned, and the conditioning actually gets worse for larger discretizations. But this isn’t necessarily the most intuitive approach to reasoning about convergence in this case, so we’ll talk about a couple ways that we can think about these concepts.

Convergence of Krylov Subspace Methods

Preconditioned problem $M^{-1} A x = M^{-1} b$
Whence $M$?
- From a stationary method?
- From a simpler/coarser discretization?
- From approximate factorization?

Part of the reason we care about the convergence theory for methods like CG is that it helps us think through ways that we could transform the problem in order to get faster convergence. This type of transformation is called preconditioning.

A good preconditioner involves a matrix M for which we can do linear solves quickly, such that inv(M)*A is “close to” an identity. Examples of preconditioning strategies include sweeps of stationary methods like Jacobi, Gauss-Seidel, SOR, or their block variants; coarser discretizations of a PDE in question (if A comes from a fine discretization); approximate LU or Cholesky factorizations; and so forth. The best preconditioners often rely on knowledge of the “physics” of the problem. For example, you might drop some terms in a complicated PDE that correspond to physical effects that are nontrivial, but also don’t dominate. Of you might precondition by solving the problem on a geometrically simpler domain where you can apply transform methods.

Preconditioning is usually the single most important thing that you, as a user of iterative methods, can do to get the methods to converge fast. It’s also one of the things that requires the most insight and experimentation to get right.

PCG

r = b-A*x;
p = 0; beta = 0;
z = Msolve(r);
rho = dot(r, z);
for i=1:nsteps
    p = z + beta*p;
    q = A*p;
    alpha = rho/dot(p, q);
    x += alpha*p;
    r -= alpha*q;
    if norm(r) < tol, break; end
    z = Msolve(r);
    rho_prev = rho;
    rho = dot(r, z);
    beta = rho/rho_prev;
end

So far, we’ve been talking about Krylov subspace methods in the abstract. Let’s make it a little more concrete now.

This is the code for the preconditioned conjugate gradient method. Unless you’ve seen it before, it is probably utterly non-obvious how this minimizes a quadratic over a Krylov subspace! But it does.

At each step, we apply the preconditioner solve to the residual b-Ax. The resulting vector z is combined with the previous step direction in order to get a new step direction p. Then we move in the p direction from the current point until we get to the minimum along that ray; that’s alpha times p. Finally, we update the residual and check whether it’s small enough that we’re willing to declare convergence.

Feel free to stop and stare at this for a moment before moving on. One of the things to notice is the data dependencies in the code. There are several things that could be run concurrently or in a pipelined fashion, but there are at least two synchronization points per step (in the computation of alpha and the computation of rho, both involving dot products, both of which must be complete before we can do much else).

PCG parallel work

Solve with $M$
Product with $A$
Dot products and axpys

Pushing PCG

Rearrange if $M = LL^T$ is available
Or build around “powers kernel”
- Old “s-step” approach of Chronopoulos and Gear
- CA-Krylov of Hoemmen and Demmel
- Hard to keep stable

What I wrote down is a standard way of writing PCG. There’s an equivalent rearrangement if we have a factorization of M available (e.g. if there is a Cholesky factorization) that gives us a little more room per step for parallelism, which you can look up in the Templates book or similar resources. But these types of tricks at best may get us down to a single synchronization per step.

There are potentially more impressive opportunities that come from overlapping several steps of the method with each other. This is an old idea, but tricky to get right, as rearrangements that look equivalent in exact arithmetic can nonetheless behave differently in floating point. Unfortunately, those differences are huge when it comes to Krylov methods. Chronopoulos and Gear were pioneers in this idea, but Mark Hoemmen (my academic younger brother) helped push things further in his thesis work by using a different basis for the Krylov subspace. Needless to say, there’s some subtlety here.

Apart from the subtlety around their error behavior, one of the problems with the communication-avoiding Krylov ideas is that they aren’t easy to precondition.

Pushing PCG

Two real application levers:

Better preconditioning
Faster matvecs

PCG bottlenecks

Key: fast solve with $M$, product with $A$

Some preconditioners parallelize better!
(Jacobi vs Gauss-Seidel)
Balance speed with performance.
- Speed for set up of $M$?
- Speed to apply $M$ after setup?
Cheaper to do two multiplies/solves at once...
- Can’t exploit in obvious way — lose stability
- Variants allow multiple products (CA-Krylov)
Lots of fiddling possible with $M$; matvec with $A$?

Thinking on (basic) CG convergence

Sketch of information propagation

Consider 5-point stencil on an $n \times n$ mesh.

Information moves one grid cell per matvec.
Cost per matvec is $O(n^2)$.
At least $O(n^3)$ work to get information across mesh!

Our first approach to CG convergence involves thinking about information propagation. Suppose I wanted to solve a Poisson problem using the five point stencil, where the right hand side is one at the corner and zero everywhere else. The solution to the problem is nonzero on the whole domain. What about the approximations from CG? After one step, all vectors in the space are nonzero outside a ball of radius 1 (in the Manhattan distance) from the corner. So there might be nonzeros at the locations in black and light blue, but not elsewhere. One step later, the nonzeros get to the green line, and so forth. So on an n-by-n mesh, it takes 2n steps before we get vectors that are nonzero everywhere. Each step takes n^2 time, so we’re talking about O(n^3) time before we can even get any signal from one end of the mesh to the other!

Convergence by counting

Time to converge $\geq$ time to move info across
For a 2D mesh: $O(n)$ matvecs, $O(n^3) = O(N^{3/2})$ cost
For a 3D mesh: $O(n)$ matvecs, $O(n^4) = O(N^{4/3})$ cost
“Long” meshes yield slow convergence

Convergence by counting

3D beats 2D because everything is closer!

Advice: sparse direct for 2D, CG for 3D.
Better advice: use a preconditioner!

Eigenvalue approach

Define the condition number for $\kappa(L)$ s.p.d: \[\kappa(L) = \frac{\lambda_{\max}(L)}{\lambda_{\min}(L)}\] Describes how elongated the level surfaces of $\phi$ are.

Eigenvalue approach

For Poisson, $\kappa(L) = O(h^{-2})$
Steps to halve error: $O(\sqrt{\kappa}) = O(h^{-1})$.

Similar back-of-the-envelope estimates for some other PDEs. But these are not always that useful... can be pessimistic if there are only a few extreme eigenvalues.

Frequency-domain approach

FFT of e_0 $FFT of e_{10}$

Error $e_k$ after $k$ steps of CG gets smoother!

Preconditioning Poisson

CG already handles high-frequency error
Want something to deal with lower frequency!
Jacobi useless
- Doesn’t even change Krylov subspace!

Preconditioning Poisson

Better idea: block Jacobi?

Q: How should things split up?
A: Minimize blocks across domain.
Compatible with minimizing communication!

Multiplicative Schwartz

Generalizes block Gauss-Seidel

Block Jacobi — or block Gauss-Seidel — are OK starting points for preconditioning this problem and related problems. But it turns out that there is a tweak that we can make that can often do much better than block Jacobi or block Gauss-Seidel.

The idea here goes back to Schwarz, who was motivated by theoretical questions about PDEs. Consider something like Poisson equation on a domain described by a union of simple shapes, like the rectangle and the circle here. We know how to solve the PDE on a rectangle or a circle using separation of variables, but what about the combination? Schwarz proposed an iterative approach to getting a solution: solve the PDE on the rectangular part with fake data on the interior part of the boundary; then solve on the circle part, taking boundary data from the solve on the rectangle; then solve on the rectangle again, taking boundary data from the solve on the circle. Iterating back and forth between these two domains converges fairly quickly.

Schwarz wasn’t doing this as a numerical method, but the same idea works well numerically. Instead of solving on disjoint subsets of variables, block by block, we can solve on overlapping subsets of variables! The Gauss-Seidel like iteration described above, where we solve on one subset of variables, then the next, then the next, is known as multiplicative Schwarz. There is also a Jacobi-like iteration where we solve on each subdomain independently and then add up all the corrections; this is known as additive Schwarz.

Restrictive Additive Schwartz (RAS)

Dependency for RAS

Get ghost cell data (green)
Solve everything local (including neighbor data)
Update local values for next step (local)
Default strategy in PETSc

Multilevel Ideas

RAS moves info one processor per step
For scalability, still need to get around this!
Basic idea: use multiple grids
- Fine grid gives lots of work, kills high-freq error
- Coarse grid cheaply gets info across mesh, kills low freq

Tuning matmul

Can also tune matrix multiply

Represented implicitly (regular grids)
- Example: Optimizing stencil operations (Datta)
Or explicitly (e.g. compressed sparse column)
- Sparse matrix blocking and reordering
- Packages: Sparsity (Im), OSKI (Vuduc)
- Available as PETSc extension

Or further rearrange algorithm (Hoemmen, Demmel).

Reminder: Compressed sparse row

for (int i = 0; i < n; ++i) {
  y[i] = 0;
  for (int jj = ptr[i]; jj < ptr[i+1]; ++jj)
    y[i] += A[jj]*x[col[jj]];
}

Problem: y[i] += A[jj]*x[col[jj]];

Memory traffic in CSR multiply

Memory access patterns:

Elements of $y$ accessed sequentially
Elements of $A$ accessed sequentially
Access to $x$ are all over!

Can help by switching to block CSR.
Switching to single precision, short indices can help memory traffic, too!

Parallelizing matvec

Each processor gets a piece
Many partitioning strategies
Idea: re-order so one of these strategies is “good”

Reordering for matvec

SpMV performance goals:

Balance load?
Balance storage?
Minimize communication?
Good cache re-use?

Reordering also comes up for GE!