Bias-variance tradeoffs and regularization
At the end of the last class, we talked about generalization error for least squares in the presence of noise. In particular, we saw that if there is a true linear model and we compute an estimate involving a subset of the data with noisy measurements (), then We cannot generally overcome the effects of the measurement error, but we can at least hope not to amplify them. The amplification factor could be large if is large, but a more common problem is that is large. In numerical analysis, we might call this a problem with ill-conditioning; in statistics, we might refer to this as an issue with high variance in the estimator.
More generally, suppose we have where is a noise term with mean zero and variance , and we use data to fit an estimator function . Then for an unseen point that is, the squared error involves the squared bias (caused by inability of the model to fit the function even with perfect data); the variance (associated with stability of the estimation procedure under perturbations from noise); and a term associated with measurement error.
If we have an ill-conditioned fitting problem (or a high variance estimator, if you prefer), what should we do? There are three options:
Get more data.
Reduce the noise.
Change the fitting problem to reduce the ill-conditioning.
Getting more data or reducing the noise is not always practical or economical, and in any event is an issue unrelated to numerical methods. Thus, we will turn now to the last approach: regularizing the problem to reduce the ill-conditioning, possibly at the expense of a small bias.
Factor selection and pivoted QR
In ill-conditioned problems, the columns of are nearly linearly dependent; we can effectively predict some columns as linear combinations of other columns. The goal of the column pivoted QR algorithm is to find a set of columns that are “as linearly independent as possible.” This is not such a simple task, and so we settle for a greedy strategy: at each step, we select the column that is least well predicted (in the sense of residual norm) by columns already selected. This leads to the pivoted QR factorization where is a permutation and the diagonal entries of appear in descending order (i.e. ). To decide on how many factors to keep in the factorization, we either automatically take the first or we dynamically choose to take factors where is greater than some tolerance and is not.
The pivoted QR approach has a few advantages. It yields parsimonious models that predict from a subset of the columns of – that is, we need to measure fewer than factors to produce an entry of in a new column. It can also be computed relatively cheaply, even for large matrices that may be sparse.
Tikhonov
A second approach is to say that we want a model in which the coefficients are not too large. To accomplish this, we add a penalty term to the usual least squares problem: Equivalently, we can write which leads to the regularized version of the normal equations In some cases, we may want to regularize with a more general norm where is symmetric and positive definite, which leads to the regularized equations If we know of no particular problem structure in advance, the standard choice of is a good default.
It is useful to compare the usual least squares solution to the regularized solution via the SVD. If is the economy SVD, then where This filter of the inverse singular values affects the larger singular values only slightly, but damps the effect of very small singular values.
Truncated SVD
The Tikhonov filter reduces the effect of small singular values on the solution, but it does not eliminate that effect. By contrast, the truncated SVD approach uses the filter In other words, in the truncated SVD approach, we use where and represent the leading columns of and , respectively, while is the diagonal matrix consisting of the largest singular values.
and the lasso
An alternative to Tikhonov regularization (based on a Euclidean norm of the coefficient vector) is an regularized problem This is sometimes known as the “lasso” approach. The regularized problem has the property that the solutions tend to become sparse as becomes larger. That is, the regularization effectively imposes a factor selection process like that we saw in the pivoted QR approach. Unlike the pivoted QR approach, however, the regularized solution cannot be computed by one of the standard factorizations of numerical linear algebra.
Tradeoffs and tactics
All four of the regularization approaches we have described are used in practice, and each has something to recommend it. The pivoted QR approach is relatively inexpensive, and it results in a model that depends on only a few factors. If taking the measurements to compute a prediction costs money — or even costs storage or bandwidth for the factor data! — such a model may be to our advantage. The Tikhonov approach is likewise inexpensive, and has a nice Bayesian interpretation (though we didn’t talk about it). The truncated SVD approach involves the best approximation rank approximation to the original factor matrix, and can be interpreted as finding the best factors that are linear combinations of the original measurements. The approach again produces models with sparse coefficients; but unlike QR with column pivoting, the regularized solutions incorporate information about the vector along with the matrix .
So which regularization approach should one use? In terms of prediction quality, all can provide a reasonable deterrent against ill-posedness and overfitting due to highly correlated factors. Also, all of the methods described have a parameter (the number of retained factors, or a penalty parameter ) that governs the tradeoff between how well-conditioned the fitting problem will be and the increase in bias that naturally comes from looking at a smaller class of models. Choosing this tradeoff intelligently may be rather more important than the specific choice of regularization strategy. A detailed discussion of how to make this tradeoff is beyond the scope of the class; but we will see some of the computational tricks involved in implementing specific strategies for choosing regularization parameters before we are done.