CS 6241: Numerics for Data Science
SVD and other low-rank decompositions
Some factorization tools
We generally seek to approximately factor a data matrix
The symmetric eigendecomposition
where is an orthogonal matrix ( ) of eigenvectors and is the diagonal matrix of eigenvalues. We will also consider the decomposition for the generalized problem, in which where is a symmetric and positive definite matrix.The singular value decomposition
where and are orthogonal matrices (in the full version) or matrices with orthonormal columns (in the economy version), and is the diagonal matrix of singular values. The SVD is a veritable Swiss Army knife of matrix factorizations; we will use it on its own and as a building block for other methods.The pivoted QR factorization
where is a permutation of the columns of , is orthogonal, and is upper triangular. The permutation is chosen to guarantee that the magnitude of the diagonal entries of appear in descending order. The pivoted QR factorization also has a geometric interpretation that is useful in several settings.The interpolative decomposition (ID)
where is drawn from the columns of and the entries of are bounded in size (no more than two). The factorization is exact when is low rank. In many cases, the columns in are the columns that would be chosen by pivoted QR factorization.The CUR factorization
where and are drawn from the columns and rows of .
The symmetric eigendecomposition and the SVD are both closely tied to “nice” continuous optimization problems, and this connection allows us to make very strong statements about them. In contrast, pivoted QR, ID, and CUR involve discrete choices involving column and row selection, and it is much more difficult to analyze the properties that arise from these discrete choices. At the same time, these latter factorizations are often interpretable in a way that the eigendecomposition and SVD are not.
The symmetric eigenvalue problem
Among their many other uses, in data analysis tasks a real symmetric matrix
From the linear algebra perspective, a symmetric matrix also represents a quadratic form, i.e. a purely quadratic function of many variables. For a concrete vector space, we write this as
A real symmetric matrix is always diagonalizable with real eigenvalues, and has an orthonormal basis of eigenvectors
If
The optimization problem associated with the symmetric eigenvalue problem is not convex – far from it! But it is a nice optimization problem nonetheless. It has saddle points, but the only local minima and maxima are also global minima and maxima, so even standard optimizers are not prone to getting stuck in a local minimum. But there are also specialized iterations for solving the symmetric eigenvalue problem that are very efficient. While the details of these methods are a topic for a different class, it is worth sketching just a few ideas in order to understand the different types of solvers available and their complexities:
Most of the direct solvers1 for the symmetric eigenvalue problem go through two stages. First, we compute
where is an orthogonal matrix and is tridiagonal. Then we compute the eigenvalues and eigenvectors of the resulting tridiagonal matrix. The asymptotically fastest of the tridiagonal eigenvalue solvers (the “grail” code) takes time to find all eigenvalues and to find all eigenvectors; but the cost of reducing to a tridiagonal form is . When we calleig
in MATLAB, or related solvers in other languages, this is the algorithm we use.There are also several iterative methods for computing a few eigenvalues and eigenvectors of a symmetric matrix. The simplest methods are the power iteration and subspace iteration, which you may have seen in a previous class; unfortunately, these methods do not converge very quickly. The Lanczos method is the main workhorse of sparse eigenvalue solvers, and the method you will use if you call
eigs
in MATLAB, or related subroutines in other languages. Like power iteration and subspace iteration, the Lanczos method requires only matrix-vector multiplications, and so it can be very efficient when we want to compute matrix-vector products.
The Lanczos method is good at computing a few of the largest and smallest eigenvalues of a symmetric matrix (the extremal eigenvalues), together with the corresponding eigenvectors. It is not as good at computing interior eigenvalues without some help, usually in the form of a spectral transformation. But for many applications in data science, we are content to look at the extremal eigenvalues and vectors, so we will not discuss the more difficult interior case any further.
The singular value decomposition
The singular value decomposition of
Like the symmetric eigenvalue decomposition, the singular value decomposition can be viewed in terms of an associated optimization problem:
We say a function
The operator 2-norm (or spectral norm) is
; this is the Ky-Fan norm for .The Frobenius norm satisfies
; that is, is the Ky-Fan norm for .The nuclear norm satisfies
; that is, is the Ky-Fan norm for . We will see more of this norm when we talk about matrix completion on Friday.
The Eckart-Young theorem tells us that the best rank
The singular value decomposition goes by many names and has many close relatives. It is sometimes called the proper orthogonal decomposition of a data matrix; in statistics, it is the basis for principal component analysis; and in the study of stochastic processes, it is the Karhuenen-Loève decomposition. But in some contexts, the SVD is not applied directly to our data matrix
Pivoted QR and pivoted Cholesky
The SVD provides optimal low-rank approximations to data matrices, but those approximations are not particularly easy to interpret. We therefore turn now to low-rank factorizations in which the factors are formed from subsets of the columns or rows of the data matrix.
We described the pivoted QR decomposition briefly in our discussion of least squares. The idea in pivoted QR is to permute the columns of the data matrix so that the diagonal entries of the QR factorization of the pivoted matrix appear in descending magnitude, i.e.
Though we have described this decomposition in terms of explicit orthogonalizations at each step (as in the Gram-Schmidt procedure), in practice we would usually use an alternate algorithm based on orthogonal transformations. However, the algorithm we have described is useful for reasoning about a special property of the algorithm: the columns it selects are all on the convex hull of the set of columns of
Closely related to the pivoted QR algorithm is the pivoted Cholesky algorithm: for a positive semi-definite matrix
Interpolative decomposition and CUR
An interpolative decomposition is a decomposition of the form
In the interpolative decomposition, we choose a subset of the columns of
Footnotes
This is a bit of a fib in that all eigenvalue solvers are iterative. We say these methods are “direct” because we only need a constant number of iterations per eigenvalue to find the eigenvalues to nearly machine precision.↩︎
Unitarily invariant covers the complex case↩︎
This is really the diagonal of the Schur complement, for those of you who may have seen Schur complements in a discussion of Gaussian elimination in an earlier class.↩︎