Introduction
Dimensionality reduction involves taking a data set and finding corresponding vectors in some lower-dimensional space so as to preserve “important” properties of the data set. Natural desiderata for such a mapping might include:
Preserving pairwise distances (isometry)
Preserving angles locally (conformality)
Preserving neighborhood structure
Clustering points with similar class labels
A “nice to have” property is that there is a mapping that accomplishes this goal and is learned in a somewhat explicit form.
So far, we have implicitly studied several linear dimensionality reduction methods in which the mapping from is a linear (or possibly affine) map. Examples include PCA and Fisher LDA embeddings, and we can also see some other matrix factorizations in a similar light. These methods are quite powerful, and are backed by standard linear algebra types of tools, at least for the standard loss functions we have used. There is much more to say; for instance, we have not discussed interesting topics like robust PCA, though we have seen many of the numerical methods that are ingredients for solving these problems
However, sometimes the restriction to only using linear maps is quite severe. For this reason, in this lecture, we consider nonlinear dimensionality reduction techniques. We will return to the topic later in specific contexts (kernel PCA when we talk about kernel methods, Laplacian eigenmaps when we talk about graph-based dimensionality reduction and clustering). But for now, we introduce a handful of influential dimensionality reduction methods that we will not have as much time to discuss later. As in the rest of the class, our goal is partly to highlight the idea of the schemes, but just as much to highlight interesting numerical methods that go into their efficient implementation: Nystrom approximation, connections to the SVD, clever uses of sparsity, fast multipole approximations, and so forth.
PCA and multi-dimensional scaling
We start with the idea of multi-dimensional scaling (MDS), which attempts to construct coordinates such that the distances approximate some known “dissimilarity measure” between and . We observe that for we can write the squared distance matrix as where is a vector of all ones and . Let denote a centering transformation i.e. subtracts the mean of a vector from each component. Note that , and is the matrix with columns where is the mean of the columns. Therefore, applying to on the left and right eliminates the rank-one terms; and scaling by gives us
Thus, is a Gram matrix for the centered data matrix, abd we can recover (up to an orthogonal transformation) via the truncated eigendecomposition of : for some orthogonal matrix . If we want a lower-dimensional embedding that optimally approximates the Gram matrix, we need only take fewer eigenvectors. Note that the eigenvalues and vectors of are the singular vectors and (squared) singular values of . Thus, MDS when Euclidean distance is used for the dissimilarity is equivalent to PCA.
There are other forms of multi-dimensional scaling that use other objectives in fitting to a dissimiliarity matrix. For example, metric MDS minimizes the “stress” and other forms of MDS (Sammon mapping, Sammon with Bregman divergences, ordinal MDS) use yet other measures.
Nystrom and landmarks
We have seen before that there are other ways to represent a low-rank matrix than with an SVD. In particular, we can construct a representation with a subset of rows and columns of the matrix. We have referred to this as the CUR decomposition in general; in the symmetric case, it is often known as the Nystrom approximation. That is, if is a square symmetric matrix with rank , then there is a symmetric permutation such that is invertible and Moreover, if we take the economy QR decomposition then we can decompose and get the (truncated) eigendecomposition
What does this have to do with MDS? Consider the case now where we use the Nystrom approximation of the squared distance matrix : Centering gives where Taking the QR decomposition allows us to approximate the largest eigenvectors as before.
The selected columns in the Nystrom scheme correspond to “landmarks” in the multi-dimensional scaling setting. The idea is that we never need to know the distance between arbitrary points; we only need the distances between those points and a set of landmarks. If we choose landmarks, the time required for this method is rather than , and we only need to evaluate (and store) distances.
The process also gives us an affine map that we can apply to embed new points in the lower-dimensional space after training; that is
Compute the vector of distances from the new point to each landmark, and let be the “centered” version that comes from subtracting the mean distances to the landmarks among the original training data.
The -dimensional embedding vector is now given by .
Isomap
Preserving geometry does not always mean preserving distances in Euclidean space. The Isomap algorithm is intended for the case when the data points lie close to some lower-dimensional manifold embedded in a high-dimensional Euclidean space. Rather than looking at Euclidean distances between points, Isomap looks at a matrix of (approximate) geodesic distances for the manifold. Unfortunately, because the manifold is implicit, the geodesic distances cannot be computed directly, but are approximated by shortest paths through a network connecting nearest neighbors, either taking a fixed number of nearest neighbors per node (-Isomap) or taking all neighbors within some radius of each node (-Isomap).
There are many clever ideas for finding nearest neighbors, which we will not go into here. The brute force approach requires time. Once the nearest neighbor graph is computed, however, we must compute all shortest paths through the graph, a task that generally takes time using the Floyd-Warshall algorithm. Computing a full eigendecomposition of the resulting (centered and squared) distance matrix takes another time. Fortunately, the landmark approach for MDS applies equally well here, and requires only that we find the distances from landmarks to all other nodes ( time via Dijkstra’s algorithm), after which our linear algebra costs are .
The Isomap algorithm is a beautiful and useful idea, but with some important limitations. Inherent in the setup is the idea that there is a good embedding of the manifold in a low-dimensional Euclidean space; but the existence of such an embedding depends on the topology of the manifold. The sphere surface is a two-dimensional manifold that is easily represented in , for example, but cannot be continuously embedded into . Isomap may also do poorly on manifolds with high Gaussian curvature, or on manifolds that are not very densely sampled.
LLE
A popular alternative to Isomap is the locally linear embedding (LLE). Similar to Isomap, the LLE algorithm starts by building a graph between points in the data set and nearby neighbors. For each node , one seeks to find weights to approximately reconstruct from the neighbors by minimizing subject to the constraint (necessary for translational invariance). Computationally, this step involves solving a large number of independent small equality-constrained least squares problems, one for each node. But it can equivalently be seen as finding a weight matrix with a given sparsity that minimizes
Once the weights have been computed, one seeks a matrix to minimize subject to the constraints that and (i.e. the coordinate system is centered). This gives that is times the singular vectors associated with the smallest singular values of apart from the null vector . Because the weight matrix is sparse, these smallest singular values can be computed using standard sparse eigensolver iterations such as the Lanczos method.
t-SNE
The -distributed Stochastic Neighbor Embedding (-SNE) is a popular method for mapping high-dimensional data to very low (usually 2) dimensional representations for visualization. Given a data point , one defines a conditional probability that would appear as its “neighbor” proportional to with the diagonal probabilities set to zero. To simplify things somewhat, we work with the symmetrized conditional probabilities . In the lower-dimensional space, one makes a similar construction, but with a heavier-tailed -distribution: The embedding is then chosen so that the KL-divergence which measures the distance between the joint distribution in the high-dimensional space and the joint distribution in the low-dimensional space, is as small as possible.
Though the t-SNE paper starts from the stochastic motivation associated with the earlier SNE method (like t-SNE, but using a Gaussian distribution in the low-dimensional space as well), the authors describe some of the motivation in very mechanical terms. The original SNE method was subject to the points crowding too close together, and they note that t-SNE has a “repulsive force” term that pushes them apart. The force interpretation comes from treating the KL-divergence as a potential energy, in which case the gradients are naturally seen as forces (where here is the normalization term ). We can split this into a short-range attractive piece which only involves a local neighborhood because of the rapid decay of the Gaussian in ; and a long-range repulsive piece Though we cannot take advantage of sparsity in this long-range repulsion term, we can take advantage of smoothness; that is, when . Thus, we can “clump together” all the terms that are sufficiently far from a region in space. This is the same as the idea that we can treat the gravity of the sun as a point mass from the perspective of far-away bodies like the planets, and we can similarly approximate the gravitational pull of distant galaxies as a single pull from a (tremendously large but tremendously distant) center of mass. This is the idea underlying tree codes like the famous Barnes-Hut algorithm. With more careful control of the error, one gets the famous fast multipole method, which can also be used in this setting. The same idea appears in fast algorithms for kernel interpolation and Gaussian process regression in low-dimensional spaces, as we will discuss in a few weeks.
Autoencoders
We will say relatively little about neural networks in this class, but it is worth saying at least a few words about the idea of an autoencoder. As with many things involving neural networks, autoencoders seem to work better in practice than we know how to argue that they ought to.
The idea behind an autoencoder is that one seeks to train a neural network with a “bottleneck” to approximate the identity. That is, we seek weight vectors and for two halves of a feed-forward neural network so that where maps from the high-dimensional input space to the outputs of a low-dimensional “bottleneck” layer, and maps back up to the high-dimensional space. The weights are trained by stochastic gradient descent on a loss such as function . Variational autoencoders output not just an intermediate feature vector, but the parameters for an intermediate feature distribution (e.g. means and variances on each parameter). The gain for this additional complexity is a tendency to favor smoother mappings.