CS 6241: Numerics for Data Science
Sparse least squares and iterations
Direct methods for large least squares
Our brief discussion of methods for least squares has so far focused on dense factorization methods (normal equations with Cholesky, QR, and SVD). For
If
is large but is not too large (say on the order of a few hundreds, or even 1000-2000), we might still use a standard factorization, but arranged to be efficient on parallel machines. The tall skinny QR (TSQR) approach involves breaking the observations into groups and doing a factorization on each group, then recursively combining the factorizations.In many cases with large
, the matrix is often sparse: that is, most of the entries of might be zero. Sometimes, we have that is data-sparse: that is, it has special structure that can be described with far fewer than parameters.
In the sparse case, we are sometimes able to use sparse direct methods. That is, if
Frequently, though, sparse direct methods are simply impractical. In this case, we turn to iterative methods.
Iterative methods
We started with a discussion of direct methods for solving least squares problems based on matrix factorizations. These methods have a well-understood running time, and they produce a solution that is accurate except for roundoff effects. For larger or more complicated problems, though, we turn to iterative methods that produce a series of approximation solutions.
We will turn now to iterative methods: gradient and stochastic gradient approaches, Newton and Gauss-Newton, and (block) coordinate descent. We will see additional solver ideas as we move through the class, but these are nicely prototypical examples that illustrate two running themes in the design of numerical methods for optimization.
Fixed point iterations
All our nonlinear solvers (and some of our linear solvers) will be iterative. We can write most as fixed point iterations
Model-based methods
Most nonquadratic problems are too hard to solve directly. On the other hand, we can model hard nonquadratic problems by simpler (possibly linear) problems as a way of building iterative solvers. The most common tactic — but not the only one! — is to approximate the nonlinear function by a linear or quadratic function and apply all the things we know about linear algebra. We will return to this idea in when we discuss Newton-type methods for optimization.
Gradient descent
One very simple iteration is steepest descent or gradient descent:
To understand the convergence of this method, consider gradient descent with a fixed step size
The Benefits of Slow Convergence
How steepest descent behaves on a quadratic model is how it behaves generally: if
Somewhat surprisingly, sometimes we want this slow convergence. To illustrate why, consider the Landweber iteration, which is steepest descent iteration applied to linear least squares problems:
Preconditioning stationary iterations
While the slow convergence of iterations like Landweber has some surprising advantages, sometimes it is just a pain. However, we can speed up these transformations by preconditioning the problem. That is, rather than applying the Landweber iteration to the problem
Krylov subspace iterations
We now consider a more general class of iterative methods that accelerates the convergence of methods like Landweber. There are many ways to derive these accelerated methods (Krylov subspace methods). We deliberately choose a somewhat unorthodox description that highlights the connections to other accelerated solvers we will encounter later in the class, as well as to our final unit on learning dynamical systems from data.
Let’s momentarily consider the case of solving a linear system
Stationary iterations are an example of a linear time-invariant (LTI) dynamical system in discrete time. The dynamics can be described entirely by the eigenvalue decomposition of the iteration matrix
An alternative approach is to “learn” the filter from the data by considering all possible filtered sequences, i.e. we consider
It turns out that Krylov subspaces often contain very good approximations to the solution to a linear system. Different Krylov subspace methods choose the “best” approximate solution to
Applying CG and MINRES to the normal equations gives an algorithms that, while effective in principle, are not as numerically stable as one might like. One can rearrange these algorithms to get more stable versions specifically for least squares problems; CG in this setting is the basis for LSQR, and MINRES is the basis for LMRES.
Gradient descent with errors
Before we turn to stochastic gradient descent, let us instead look at how to analyze gradient descent with errors. In particular, consider the iteration
In order to analyze the second term in this iteration, we need some additional sort of control. In the simplest case, that control might be deterministic. For example, if we can guarantee that
We will pick up this iteration again next time under the assumption that the errors are random, which is what happens in the stochastic gradient method.