CS 5220

Applications of Parallel Computers

Sparse direct methods

Prof David Bindel

Please click the play button below.

Reminder: Sparsity and reordering

Permute unknowns for better SpMV or

Stability of Gauss elimination,
Fill reduction in Gaussian elimination,
Improved performance of preconditioners...

Reminder: Sparsity and partitioning

Want to partition sparse graphs so that

Subgraphs are same size (load balance)
Cut size is minimal (minimize communication)

Matrices that are “almost” diagonal are good?

When we think about reordering operations, we almost invariably think about the graphs that are associated with the matrices. We’re thinking about square matrices, so we have the same number of rows and columns. The idea is that for each row or column, we have a node in the graph; and we include an edge from node i to node j whenever A_ij is nonzero. For today, we’ll deal with the (structurally) symmetric case, which gives us undirected graphs.

Closely related to the problem of reordering is the problem of partitioning a graph into blocks that are about the same size, but with few edges between them. Then we can assign those blocks to different processors, and have minimal communication between processors for either matrix multiplication or (ideally) for Gaussian elimination. The simplest version of this is reordering matrices to be almost diagonal, a structure that is good for both sparse matrix-vector products and for Gaussian elimination.

Reordering for bandedness

Natural order RCM order

Reverse Cuthill-McKee

Select “peripheral” vertex \(v\)
Order according to breadth first search from \(v\)
Reverse ordering

One way to try to order unknowns to make things as narrowly banded as possible is to order by (reverse) breadth-first search from a peripheral vertex. This is known in this business as reverse Cuthill-McKee ordering.

The plots at the top show the nonzero pattern for the five-point stencil discretization of the Laplacian on a grid. This is sometimes called a spy plot: the idea is that each dot is a nonzero, and blank space corresponds to zero elements. The first plot is in “natural” (i.e. column major) order. Each row associated with an interior point has five nonzeros: the diagonal and the first super- and subdiagonals (corresponding to the up and down neighbors), and two nonzeros further out that correspond to the left and right neighbors. The second plot is the reordered version using reverse Cuthill-McKee. As you can see, the maximum bandwidth is about the same either way, but reverse Cuthill-McKeee gives us a narrower bandwidth at the top and bottom.

From iterative to direct

RCM ordering is great for SpMV
But isn’t narrow banding good for solvers, too?
- LU takes \(O(nb^2)\) where \(b\) is bandwidth.
- Great if there’s an ordering where \(b\) is small!

Skylines and profiles

Profile solvers generalize band solvers
Skyline storage for lower triangle: for each row \(i\),
- Start and end of storage for nonzeros in row.
- Contiguous nonzero list up to main diagonal.
In each column, first nonzero defines a profile.
All fill-in confined to profile.
RCM is again a good ordering.

Beyond bandedness

Minimum bandwidth for 2D model problem? 3D?
Skyline only gets us so much farther

Beyond bandedness

But more general solvers have similar structure

Ordering (minimize fill)
Symbolic factorization (where will fill be?)
Numerical factorization (pivoting?)
... and triangular solves

So we don’t want to stop with band solvers. But it turns out that more general sparse direct solvers involve a similar set of ingredients. First, we want to reorder the matrix in order to minimize the number of nonzeros that will appear in our factorization, just like we try to minimize the area under the skyline before running a profile solver. Then we want to figure out where the nonzeros in the factors will appear, so we can allocate storage appropriately – trivial in the skyline case, not so trivial more generally. Then we need to actually do the factorization, though maybe we’re rightfully worried that we ought to pivot for stability in the general case. And, finally, to solve linear systems we’ll end up doing some triangular solves with the computed factors, and the complexity of those triangular solves is directly proportional to the number of nonzeros in the triangular factors.

All right, this is pretty abstract. Let’s make the picture concrete with a couple of examples.

Troublesome Trees

One step of Gaussian elimination completely fills this matrix!

Terrific Trees

Full Gaussian elimination generates no fill in this matrix!

Now let’s consider looking at the same problem, but just ordering the unknowns differently, working from the leaves up to the root in order to get a “downward pointing” arrow. What happens when we run elimination on this matrix? Try walking through the algorithm by hand! You will find that we don’t have to deal with any nonzeros that lie outside the nonzero structure of the original matrix – that is, there are no “fill” elements.

Again, we can see the lack of fill in terms of the residual graph operation we described a moment ago. When we get rid of the first leaf — the green one — we have to connect all its neighbors to each other. But the only neighbor is the root, so we get no new edges! And something similar happens for eliminating all the other leaves, until the root is the last thing left.

Graphic Elimination

Consider first steps of GE

A(2:end,1)     = A(2:end,1)/A(1,1);
A(2:end,2:end) = A(2:end,2:end)-...
                 A(2:end,1)*A(1,2:end);

Nonzero in the outer product at \((i,j)\) if A(i,1) and A(j,1) both nonzero — that is, if \(i\) and \(j\) are both connected to 1.

General: Eliminate variable, connect remaining neighbors.

Terrific Trees Redux

Order leaves to root \(\implies\)
on eliminating \(i\), parent of \(i\) is only remaining neighbor.

Nested Dissection

Idea: Think of block tree structures.
Eliminate block trees from bottom up.
Can recursively partition at leaves.

What do I mean when I say that lots of graphs look like trees when we take off my glasses? Well, a less colloquiual way of putting it is to say that it is OK not to have a tree-structured matrix if we can reorder in such a way that a blocked version of the matrix is structured like a tree. For example, consider this five-by-five mesh, which has the great advantage that it fits on a single slide. If we reorder the blue nodes first, then the green nodes, and then the black nodes, the resulting matrix has a block arrow structure, with the corresponding graph shown on the right. And while we can see fill within these blocks, we will not see any fill outside of the block structure.

Of course, one of the key techniques in computer science is divide and conquer. So if we can partition the matrix into a block form that in a way that tells us how to do elimination well, we want to keep applying the idea to do partitioning within blocks until we eventually run out of steam. This idea of recursively partitioning a graph in order to get a deep tree-like structure is called nested dissection.

Nested Dissection

Rough cost estimate: how much just to factor dense Schur complements associated with separators?
Notice graph partitioning appears again!
- And again we want small separators!

It turns out that when we do nested dissection, the cost is largely dominated by the time to solve dense Schur complement systems on each of the separators. That is, in our five-by-five mesh example, after we get rid of all the blue and green unknowns, we will be left with a fully dense trailing submatrix for the black unknowns (all five of them), which we would solve by ordinary dense factorization. More generally, top separator for an n-by-n mesh is size n, and factoring the resulting dense matrix is O(n^3) time (or N^(1.5)). And it turns out that this captures the asymptotic complexity of the algorithm overall.

You might notice in all this that the key to our complexity result is the existence of small separators that partition the graph at various levels. So graph partitioning appears again! This idea of having small separators is not unique to Gaussian elimination by the way; it shows up in other graph algorithms, too. Those of you who have taken an algorithms class where treewidth is discussed might recognize our argument as saying that the treewidth for our n-by-n mesh graph is just n.

As an aside: you might notice also that the top-level separator factorization is O(n^3), but it’s going to be a pretty efficient O(n^3) – this is all level 3 BLAS stuff. So while the asymptotic complexity of a sparse direct solve with nested dissection on our model problem is no better than the asymptotic complexity of unpreconditioned CG on the same problem, there’s a big difference in the constants, and you’re generally better off using the sparse direct solver (or using a good preconditioner for CG).

In comparison, band LU for this matrix would take O(n^4) (or N^2) time.

Nested Dissection

Model problem: Laplacian with 5 point stencil (for 2D)

ND gives optimal complexity in exact arithmetic
(George 73, Hoffman/Martin/Rose)
2D: \(O(N \log N)\) memory, \(O(N^{3/2})\) flops
3D: \(O(N^{4/3})\) memory, \(O(N^2)\) flops

It turns out that this nested dissection strategy is asymptotically optimal for our model problem, and for a wide set of related problems coming from local discretization of elliptic PDEs by finite elements or finite differences. In 2D, the cost is O(N^1.5); in 3D, the cost is O(N^2). The more prohibitive cost is the memory cost, though; in particular, solving a big 3D problem with a sparse direct method is likely to run you out of memory on lots of machines!

When I say that this is an optimal complexity algorithm, I mean that it’s optimal complexity for solving the problem exactly in exact arithmetic, looking only at the graph structure. But there’s some wiggle room in that statement. Those dense separator factorizations that I described in the previous slide have a lot of structure to them; in fact, for problems like the model problem, these separator problems tend to look a lot like discretizations of boundary integral equations. These discretizations have the property that interactions between nodes that are far away from each other are “smooth” in a way that lends itself to compression. From a linear algebra perspective, we get nearly low-rank submatrices in these separator subproblems, and that’s enough to get very good — though not exact — solvers for the separator problems in sub-cubic time. There has been a lot of work over the past decade on “super-fast” nested dissection and related algorithms based on these ideas.

Minimum Degree

Locally greedy strategy
- Want to minimize upper bound on fill-in
- Fill \(\leq\) (degree in remaining graph)\(^2\)
At each step
- Eliminate vertex with smallest degree
- Update degrees of neighbors
Problem: Expensive to implement!
- But better varients via quotient graphs
- Variants often used in practice

Elimination Tree

Variables (columns) are nodes in trees
\(j\) a descendant of \(k\) if eliminating \(j\) updates \(k\)
Can eliminate disjoint subtrees in parallel!

Cache locality

Basic idea: exploit “supernodal” (dense) structures in factor

e.g. arising from elimination of separator Schur complements in ND
Other alternatives exist (multifrontal solvers)

Pivoting

Pivoting is painful, particularly in distributed memory!

Cholesky — no need to pivot!
Threshold pivoting — pivot when things look dangerous
Static pivoting — try to decide up front

What if things go wrong with threshold/static pivoting?
Common theme: Clean up sloppy solves with good residuals

So far, I’ve been talking about unpivoted elimination. For symmetric and positive definite matrices like those from our model problem, there is no need to pivot. But not every matrix is symmetric and positive definite, and so we sometimes want to pivot for numerical stability. That said, pivoting is so painful for sparse Gaussian elimination — particularly in a distributed memory setting — that we might well decide to try to maintain our desired order as much as possible even if we have to give up a little numerical stability. For example, we might only pivot when things start to look really bad (known as threshold pivoting), or we might try to pick a good pivot order statically and just fudge things a bit if it turns out to be terrible. Either way, we’d typically try to recover the stability by using the factorization not on its own, but as a part of an iterative refinement scheme or a flexible Krylov method.

Direct to iterative

Can improve solution by iterative refinement: \[\begin{aligned} PAQ &\approx LU \\ x_0 &\approx Q U^{-1} L^{-1} Pb \\ r_0 &= b-Ax_0 \\ x_1 &\approx x_0 + Q U^{-1} L^{-1} P r_0 \end{aligned} \] Looks like approximate Newton on \(F(x) = Ax-b = 0\).
This is just a stationary iterative method!
Nonstationary methods work, too.

Variations on a theme

If we’re willing to sacrifice some on factorization,

Single precision factor + double precision refinement?
Sloppy factorizations (marginal stability) + refinement?
Modify \(m\) small pivots as they’re encountered (low rank updates), fix with \(m\) steps of a Krylov solver?

Parting advice

Sparse direct for 2D problems
Gets more expensive for 3D problems
Approximate direct solves make good preconditioners

Let me wrap up the discussion of sparse linear algebra with a bit of advice. Unless you want to be an expert in the area, you probably don’t want to build your own sparse direct solvers. But there are good options out there of sparse direct codes that other people have written. You might be able to find iterations that are more asymptotically efficient than these direct solvers for 2D problems, but asymptotics aren’t everything — the constants for the sparse direct solvers are pretty good, because of their reliance on level 3 BLAS.

For 3D, the picture is more complicated. Direct solvers are hard to beat for robustness, and that’s nothing to sneer at — commercial finite element vendors still default to sparse direct solvers, because they basically just work even on hard problems, and without a user having to learn how to massage preconditioned iterative methods into working well. But if you can figure out a good preconditioning strategy, the advantage generally goes to preconditioned iterative methods. Even there, though, it’s worth having a sparse direct solver in your toolkit in order to compute approximate factorizations as a preconditioner, or to compute submatrix factorizations for a block Jacobi or Schwarz scheme, or to do the coarse-grid solves in a multigrid scheme.

In the words of that immortal numerical analyst, Shrek: solvers are like onions! They have layers!

OK, and with that final joke — which is perhaps funny only to me — I think it’s time to wrap this up.