CS 6241: Numerics for Data Science
Graph clustering and partitioning
From semi-supervised to unsupervised
In the last lecture, we discussed the use of graphs for semi-supervised learning tasks in which we are given labels on some nodes and want to find other nodes. In this lecture, we turn to the unsupervised task of interpreting the graph structure without the benefit of auxiliary data such as class labels. Specifically, we consider the closely related tasks of graph partitioning, graph clustering, graph embedding, and nonlinear manifold learning.
Graph bisection
We begin with a discussion of spectral graph partitioning. In the simplest setting, we are given an unweighted graph, and we want to bisect the graph: that is, we seek
For data analysis, we might want assign one of two class labels to every graph node based on a similarity measure encoded through the graph. However, we want to rule out the trivial solution where all nodes are assigned to the same class. In the semi-supervised problem from last class, we avoided this case with the help of labeled training examples. In the unsupervised setting, we use class size constraints toward the same end.
As we saw at the end of last lecture, in nested dissection ordering methods for solving linear systems, we want to recursively find small vertex separators that partition the graph into roughly equal size pieces. Of course, we have posed the problem in terms of finding good edge separators for a graph, but some of the same ideas apply to both problems. Also, we can construct vertex separators from edge separators by taking an endpoint for each cut edge.
In static load balancing problems in parallel computing, we sometimes want to distribute work (represented by nodes) to two processors in an equitable way while ensuring that the interprocessor communication (represented by cut edges) is as small as possible.
In each of these cases, we may ultimately want more than two pieces, or we may want a more refined notion of the partition quality than the one we have considered. But this is a good starting point nonetheless.
We can encode the basic problem in matrix terms via the graph Laplacian. Let
The basic strategy for spectral graph partitioning then is:
Approximately compute the Fiedler vector
using the Lanczos iteration, which requires only matrix-vector products with .Partition based on the signs of the elements of
.(Optionally) refine with a local algorithm such as Kernighan-Lin.
Spectral methods are not the only way; methods like multi-level Kernighan-Lin are also popular. But spectral partitioning is frequently used, it works well, and it fits well with the overall narrative of these lectures!
Weights and normalization
So far, we have covered the case of spectral methods for bisection of unweighted graphs. But what if there are node and edge weights? That is, we suppose that we want partition to minimize a weighted edge cut subject to the constraint that a sum of node weights in both halves is zero. Let
Normalized cuts
Another variant on the same idea uses the quadratic form
Modularity maximization
As another example, we consider the problem of finding communities with high “modularity,” defined to be the number of internal edges minus the number that we would expect in a “configuration model” that reflects the node degrees but otherwise assigns edges randomly. In matrix terms, if
Mixing in random walks
So far, we have focused on spectral methods motivated by optimization, but this is not the only way to arrive at these approach. Eigenvalues and eigenvectors are also good for analyzing dynamics on networks, e.g. when thinking about the convergence of random walks.
A (discrete time) Markov chain is a time-indexed set of random variables
We often think of discrete time Markov chains over a finite state space in terms or random walks on an associated graph. If
We say
accesses if there is a path from to through the graph.We say
and communicate if access and vice-versa.We call the Markov chain irreducible if all nodes communicate.
The period of node
is the GCD of the length of all closed paths beginning and ending at . If there are no such paths, we say that the period of is infinite. We note that if and communicate with each other, then they must have the same period.The Markov chain is aperiodic if all nodes have period one.
Every Markov chain over a finite state space has a stationary distribution
A Markov chain is reversible if it has a stationary distribution for which it satisfies detailed balance, i.e.
The structure of the irreducible components of a Markov chain are reflected in in the nonzero structure of the stationary vectors. There is a basis of row eigenvectors that are indicators for maximal accessing sets, and a basis of stationary distributions that are supported on minimal accessible sets. For example, suppose a Markov chain is reducible with two irreducible components. If neither set can access the other, the transition matrix is block diagonal:
If the second irreducible set can access the first irreducible set, the transition matrix is block upper triangular with a nonzero off-diaginal block:
Thus far, we have described properties of the Markov chain related to the stationary state or states. For an aperiodic chain with a unique stationary state, the rate of convergence to stationarity can be expressed in terms of the second largest eigenvalue, i.e.
For example, Simon-Ando theory deals with almost-reducible Markov chains, e.g.
In the case of a Markov chain with disjoint connected components, when there is an eigenvalue at one with geometric multiplicity greater than one, the eigenvectors at one are not uniquely determined. However, the invariant subspaces (left and right) spanned by the eigenvectors are uniquely determined. In the weakly coupled cases, the eigenvectors associated with eigenvalues close to one generally are uniquely determined, but they are sensitive to small changes in the chain, and they may individually be hard to compute using standard iterations. However, for many applications, we don’t care about the eigenvectors, but about the invariant subspace that they span. It is this subspace that is useful in clustering, for example, and it is equally useful whether we represent it in terms of the eigenvector basis or in terms of another basis.
This perspective on mixing of Markov chains gives us another way of thinking about graph clustering and partitioning via normalized cuts. The eigenvalues problem that arises from normalized cuts is
Geometric embedding
While we started the lecture with graph bisection problems in which all we looked at was a single eigenvector, we have now seen several different excuses to try to extract information about a graph from a low dimensional invariant subspaces of associated matrices:
Our discussion of Markov chains suggests that we may want to look at a dominant invariant subspace of
or where is the transition matrix for a Markov . We can think of this subspace as interesting because of what it tells us about natural clusters from the dynamics perspective (fast-mixing subchains within a slowly-mixing chain).We can also think of this subspace as interesting because it consists of “smooth” functions that form a basis for good approximate solutions to an optimization problem.
Another way to think about things is through the lens of kernel approximation. Recall from last lecture that we defined a kernel function associated with the pseudoinverse of the Laplacian, and associated “Laplace features” with that kernel.
These Laplace features can also be seen in terms of a low-dimensional embedding of the graph nodes in a Euclidean space in order to optimally approximate the resistance distance. More generally, if we have a squared distance matrix
between objects (i.e. ), then the centered distance matrix is a positive semi-definite Gram matrix, and the eigenvectors of associated with the largest eigenvalues give us a mapping of objects to points in that approximates the distance matrix as well as possible in a least squares sense.
Finding an “optimal” geometric embedding frequently reduces to an eigenvalue problem, but we can also compute such coordinate systems via other factorizations, such as pivoted QR or Cholesky factorization. This corresponds to an embedding that may not be optimal, but is focused on the relation between general data objects and a handful of reference examples.
Once we have a low-dimensional embedding that we think captures the relevant geometry, we can apply geometric methods such as
Footnotes
We will assume throughout this lecture that our graphs have a single connected component, so that the zero eigenvalue of
has multiplicity 1.↩︎We are taking the numerical linear algebra convention that probabilities distributions represent column vectors, and
denotes the probability mass function for given . Statisticians often denote probability mass functions by rows, and reverse the roles of and ↩︎