CS 6241: Numerics for Data Science

Function approximation on graphs

Author

David Bindel

Published

April 10, 2025

Semi-supervised learning

Suppose we have a collection of objects that we want to classify one of two ways. Given some labeled examples, how should we label the remaining objects? This is a standard semi-supervised learning task. Of course, labels alone do not help us unless we have some idea how the objects are related to each other. In this lecture, we will assume that this information comes in the form of a weighted graph, where the objects to be classified are vertices and the edge weights represent the degree of similarity or connectedness. Our problem, then, is to label the remaining objects so that — as much as possible — similar objects will share the same label. Before writing methods, we will first introduce some notation. We will start with the two-class case, and turn to the multi-class problem later. Let

x

be the vector of class labels; ideally, we would like

x \in {0, 1}^{n}

. We order the vertices so that the labeled examples appear last, and partition

x

into unlabeled and labeled subvectors:

x = [\begin{matrix} u \\ y \end{matrix}],

where

u \in {0, 1}^{n_{u}}

is unknown and

y \in {0, 1}^{n_{y}}

is known. We let the weighted adjacency matrix

A

encode the similarity, and let

L = D - A

be the weighted Laplacian. To measure the quality of a class assignment, we look at the quadratic

x^{T} L x = \sum_{(i, j) \in E} a_{i j} (x_{i} - x_{j})^{2}

which, for 0-1 vectors, gives the total weight of all between-class edges. We partition

L

and

A

conformally with the partitioning of

x

L = [\begin{matrix} L_{u u} & L_{u y} \\ L_{y u} & L_{y y} \end{matrix}] .

Then we have

x^{T} L x = u^{T} L_{u u} u + 2 u^{T} L_{u y} y + y^{T} L_{y y} y .

Alas, optimizing this function with respect to the class assignments

u

is a challenging discrete optimization problem.

Soft labels

The optimization is easier if we relax the problem, replacing binary class labels with real-valued soft labels. Then we have a continuous quadratic optimization for which the critical point equation is $[L x]_{u} = L_{u u} u + L_{u y} y = 0.$ The matrix $L$ is positive semi-definite, with null vectors that are constant on each connected component; but we will assume that we have at least one labeled example in each connected component, so that $L_{u u}$ is nonsingular.

It is worth looking at the scalar equations in order to understand this system in more detail. Let us write row $i$ of the critical point equations as $d_{i} x_{i} - \sum_{j = 1}^{n} a_{i j} x_{j} = 0,$ and rearrange to find $x_{i} = \sum_{j = 1}^{n} \frac{a_{i j}}{d_{i}} x_{j} .$ The weights $a_{i j} / d_{i}$ are non-negative and sum to one, so this tells us that for each $i$ where the label is unknown, we are choosing $x_{i}$ to be the weighted average of the neighbor labels. This tells us that (for example) all of the computed soft labels will be in the interval $[0, 1]$ .

The averaging interpretation of the equlibrium of the equation suggests an algorithm for computing the soft labels by interpreting the averaging operation as an update equation: $x_{i}^{new} = \sum_{j = 1}^{n} \frac{a_{i j}}{d_{i}} x_{j} .$ Several classical point relaxation methods of numerical linear algebra follow this approach, differing in the order in which the updates are computed and applied. Jacobi iteration updates the entire $u$ vector based on the old guesses; Gauss-Seidel sweeps through the labels and updates them in a fixed order, using the most recent guesses in each update; and Gauss-Southwell chooses the next label to update adaptively, based on the size of a corresponding residual element. In machine learning, these are known as label propagation methods, though label propagation methods generally include an additional rounding operation to turn soft labels into hard labels at each step. The convergence of such iterations depends strongly on the nature of the similarity graph: if it is “tightly connected,” the iterations converge quickly, while the iterations may converge much more slowly if the connectivity is relatively sparse. We will return to this point later.

From Laplacians to kernels

We have seen an expression of the form $u = - L_{u u}^{- 1} L_{u y} y$ once before, in our initial discussion of Gaussian processes. There, instead of the graph Laplacian, we saw the precision matrix (the inverse of the covariance). We would therefore like to say that $L^{- 1}$ is a kernel. Of course, we have to worry about a slight caveat: $L$ is not invertible! Hence, while we can still define a kernel associated with $L$ , we will have to use a conditionally positive definite kernel associated with the pseudo-inverse $L^{†}$ .

The pseudoinverse as a kernel

The Laplacian pseudo-inverse is $L^{†}$ , corresponding to the minimal-norm least-squares solution to linear systems with $L$ . In terms of the eigendecomposition $L = Q Λ Q^{T}$ , the pseudo-inverse is $L^{†} = Q Λ^{†} Q^{T}$ where $λ_{i}^{†} = λ_{i}^{- 1}$ for nonzero $λ_{i}$ , and is zero otherwise. Note that $L L^{†} = L^{†} L = J$ where $J$ is the centering matrix $J = I - e e^{T} / n$ .

Indeed, we can think of the soft label problem as a kernel method involving the (conditionally positive definite) kernel matrix $L^{†}$ ; that is, $u = [L^{†}]_{u y} c + μ e$ where the weight vector $c$ is given by $[\begin{matrix} [L^{†}]_{y y} & e \\ e^{T} & 0 \end{matrix}] [\begin{matrix} c \\ μ \end{matrix}] = [\begin{matrix} y \\ 0 \end{matrix}] .$ To see this is equivalent to what we wrote before, we observe that $\begin{aligned} L_{u u} [L^{†}]_{u y} + L_{u y} [L^{†}]_{y y} & = J_{u y} = e e^{T} / n \\ L_{u u} e + L_{u y} e & = 0 \end{aligned}$ Because $e^{T} c = 0$ by construction, we therefore have $L_{u u} u + L_{u y} y = L_{u u} ([L^{†}]_{u y} c + μ e) + L_{u y} ([L^{†}]_{y y} c + μ e) = 0,$ which is indeed the equation that we used to define $u$ previously.

Laplacian features

It is also helpful to think about this kernel in terms of feature vectors. Let $Ψ^{T} = Q^{'} Λ^{' - 1 / 2}$ , where $Q^{'}$ and $Λ^{'}$ are the parts of the eigendecomposition corresponding to the nonzero eigenvalue, so that $L^{†} = Ψ^{T} Ψ$ . The columns of $Ψ$ are the feature vectors in the graph associated with the kernel, and the soft label function is equivalent to $x_{i} = ψ_{i}^{T} d + μ$ where $d$ is the minimal norm vector such that $Ψ_{y}^{T} d + μ e = y$ . To see that this is equivalent, consider the constrained optimization $minimize \frac{1}{2} ∥ d ∥^{2} s.t. Ψ_{y}^{T} d + μ e = y,$ and note that the KKT equations are $[\begin{matrix} I & Ψ_{y} & 0 \\ Ψ_{y}^{T} & 0 & e \\ 0 & e^{T} & 0 \end{matrix}] [\begin{matrix} d \\ λ \\ μ \end{matrix}] = [\begin{matrix} 0 \\ y \\ 0 \end{matrix}] .$ Eliminating the first equation $d = - Ψ_{y} λ$ gives us $[\begin{matrix} - Ψ_{y}^{T} Ψ_{y} & e \\ e^{T} & 0 \end{matrix}] [\begin{matrix} λ \\ μ \end{matrix}] = [\begin{matrix} y \\ 0 \end{matrix}],$ which we can rewrite as $[\begin{matrix} [L^{†}]_{y y} & e \\ e^{T} & 0 \end{matrix}] [\begin{matrix} - λ \\ μ \end{matrix}] = [\begin{matrix} y \\ 0 \end{matrix}] .$ This is the same system that we saw a moment ago, but with $c = - λ$ reinterpreted as a vector of Lagrange multipliers. Therefore, the minimal norm coefficient vector in the feature space is $d = Ψ_{y} c$ , which gives us the prediction $u = Ψ_{u}^{T} d + μ e = Ψ_{u}^{T} Ψ_{y} c + μ e = [L^{†}]_{u y} c + μ e .$

We will see the eigenvector features associated with the $L^{†}$ kernel again next time when we address unsupervised learning with graphs.

Electrical analogies

So far, we have focused on a purely mathematical intuition for the soft labeling problem. But we can also consider a more physical picture. We will consider the flow of current through a resistor network, which is a common choice in this business¹ We suppose there are $n$ nodes connected by resistors. At each node, we have a voltage $v_{i}$ , and on each resistor edge we have a resistance $r_{i j}$ . There are two basic ingredients to the equations:

A constitutive law: For a linear resistor, the current from $i$ to $j$ is $I_{i j} = r_{i j}^{- 1} (v_{i} - v_{j}) .$
A balance law: The total current leaving a node is zero, or $\sum_{j} I_{i j} = 0.$

Putting these two ingredients together gives us the system $\sum_{j \in N_{i}} r_{i j}^{- 1} (v_{i} - v_{j}) = 0$ at each node $i$ for which we do not explicitly control the voltage (by attaching the node to ground or a voltage supply) or inject a current. This gives us a weighted Laplacian linear system, where the Laplacian is known as the conductance matrix in circuit theory, and the edge weights $a_{i j}$ are the element conductances (inverse resistances²). Hence, the soft labeling problem is equivalent to drawing a resistive circuit network and attaching some nodes to a unit voltage supply (the examples labeled 1) and others attached to ground (the examples labeled 0). The intuition is that nodes that are connected by low-resistance edges or paths tend to have similar voltages. The Laplacian quadratic form is associated with resistive power loss.

Whether the analogy to circuit theory provides insight or not probably depends on your background. But the analogy is sufficiently widely used that it is worth knowing about, whether or not you find it provides you with any personal intuition.

Kernels and distances

Positive definite kernels define inner products in a feature space, and inner products define a Euclidean distance structure. That is, if $ψ$ is a feature map for a kernel on a space $X$ , then $\begin{aligned} ∥ ψ (x) - ψ (y) ∥^{2} & = ψ (x)^{T} ψ (x) - 2 ψ (x)^{T} ψ (y) + ψ (y)^{T} ψ (y) \\ = k (x, x) - 2 k (x, y) + k (y, y) . \end{aligned}$ In the positive definite case, we can therefore use the kernel to define a squared distance on $X$ : $d (x, y)^{2} = k (x, x) - 2 k (x, y) + k (y, y),$ and this distance satisfies all the properties that a distance is supposed to satisfy (positivity, symmetry, and the triangle inequality).

Of course, the kernel associated with the graph Laplacian is only positive semi-definite because of the null vector. The usual hazard for semi-definite kernel functions is that we might have distinct points in $X$ with the same feature vector, and a distance between two points is supposed to be nonzero if the points are distinct. We do not have to worry about this problem with the Laplacian kernel, though, as the construction in this case looks like $d_{i j}^{2} = (e_{i} - e_{j})^{T} L^{†} (e_{i} - e_{j});$ and since the vectors $e_{i} - e_{j}$ are orthogonal to the null vector of all ones, this quantity will be positive for all $i \neq j$ .

We sometimes call $d_{i j}^{2}$ the resistance distance, since in the electrical analogy it corresponds to the effective resistance between nodes $i$ and $j$ summarized over all possible network paths. In the physical analogy, the current balance law holds in the following generalized sense: if $S$ is the set of nodes for which we have specified voltages (label information), then for any $i \notin S$ , $\sum_{j \in S} d_{i j}^{- 2} (v_{i} - v_{j}) = 0;$ we can rewrite this as $v_{i} = \frac{\sum_{j \in S} d_{i j}^{- 2} v_{j}}{\sum_{j \in S} d_{i j}^{- 2}};$ that is, the computed value at node $i$ is a weighted average of the known values, where the weights are proportional to the inverse-square distances. This formula for the soft labeling function works even with other kernel functions — though, of course, we lose the circuit analogy!

The heat kernel

So far, we have focused on the inverse Laplacian graph kernel. However, this is not the only choice! Another kernel that we can use for many of the same purposes is the heat kernel, which is given by $\exp (- t L)$ . The parameter is associated with time, and the entries of $\exp (- t L)$ can be interpreted in terms of the diffusion of heat from a source at $i$ to a target at $j$ within time $t$ . Alternately, the entries $\exp (- t L)_{i j}$ can be interpreted as the probability that a continuous random walk starting from $i$ will be at $j$ at time $t$ .

Extending to multi-class learning

So far, we have focused on the two-class case with 0-1 labels. For the more general case where we want $k$ different classes, we use the same technique applied to $k$ indicator vectors, one for each class. That is, we replace the vector $x \in R^{n}$ with the matrix $X \in R^{n \times k}$ . In the hard label case, we let $x_{i k}$ be one if $i$ belongs to class $k$ and zero otherwise. In the soft label case, we assign node $i$ to the class $k$ for which $x_{i k}$ is maximal. We also have that $\sum_{k} x_{i k} = 1$ , and so sometimes $x_{i k}$ is interpreted as the probability that node $i$ belongs to class $k$ .

The Laplace solver building block

We conclude this lecture with a brief discussion of the landscape of methods for solving Laplacian linear systems.

For small systems — up to a few thousand nodes — there is not much to discuss. In these cases, forming and factoring the Laplacian matrix as a dense matrix is usually fine, and requires little thought or care. Past a few thousand nodes, though, the $O (n^{3})$ cost of a dense matrix factorization becomes prohibitive. In this case, we can either

Use a sparse direct method that computes a factorization in less than $O (n^{3})$ time, or
Use an iterative solver.

Of course, the two methods are not mutually exclusive, and we often use approximate factorizations as preconditioners for iterative methods. But it is important to recognize that many graphs are either well suited to iterative methods or well suited to sparse direct solvers. The key distinction is whether the graph can be separated by relatively small cuts (a problem we will consider in the next lecture).

When a graph can be partitioned with a small cut, we can try to solve it by a divide and conquer approach. Suppose that there is a small vertex separator that partitions the graph into two roughly-equal size pieces. If we label the two separate pieces first and then put the separator at the end, then we can write the Laplacian system in block form as $L = [\begin{matrix} L_{11} & 0 & L_{13} \\ 0 & L_{22} & L_{23} \\ L_{31} & L_{32} & L_{33} \end{matrix}] .$ The structure comes from the observation that the degrees of freedom in the two pieces (block 1 and block 2) are not directly connected. Block Gaussian elimination on the system gives us $\begin{aligned} S & = L_{33} - L_{31} L_{11}^{- 1} L_{13} - L_{32} L_{22}^{- 1} L_{23} \\ S x_{3} & = b_{3} - L_{31} L_{11}^{- 1} b_{1} - L_{32} L_{22}^{- 1} b_{2} \\ L_{22} x_{2} & = b_{2} - L_{23} x_{3} \\ L_{11} x_{1} & = b_{1} - L_{13} x_{3} \end{aligned}$ Hence, if we can quickly solve systems with $L_{11}$ and $L_{22}$ , then we can form and solve a much smaller Schur complement system to couple them together. The nested dissection approach applies this idea recursively, and gives us a very fast solver if we can find small separators.

Of course, the extreme case of small separators is when we have a tree. In this case we can produce very fast solvers that run in linear time in the matrix size. One way to see this is in the electrical network analogy: we can compute the resistance between any pair of nodes quickly because it is just the sum of the resistances along the unique path between those nodes! More generally, graphs that are associated with nearest neighbor connectivity in 2D (or sometimes 3D) tend to have small tree width, and are good for sparse solvers. There are good sparse solvers in the world, and I do not recommend writing your own. But it is important to know what graphs are well suited to sparse solvers.

The opposite extreme is when there are no small separators. In this case, though, the smallest nonzero eigenvalue of the Laplacian is usually far from zero, so that the condition number of the Laplacian system is not too large. This is exactly the situation in which standard iterative methods work well.

Footnotes

Other analogies involve pressure-driven flow through a pipe network or motion of a spring network.↩︎
In a circuit theory class, I would write the conductances as $g_{i j} = r_{i j}^{- 1}$ . But to maintain notational consistency with the rest of the lecture, we will use $a_{i j}$ here.↩︎

Other Formats