CS 6241: Numerics for Data Science

Low dimensional structure in function approximation

Author

David Bindel

Published

March 27, 2025

Introduction

Last time, we discussed how to seek low-dimensional structure in the input space for functions

f : Ω \subset R^{n} \to R

for large

n

. In this lecture, we consider how to find low-dimensional structure in the output space for

f : Ω \subset R^{n} \to R^{m}

where

m

is large and

n

is modest.

Of course, it is also possible to combine the techniques to deal with high-dimensional input and output spaces, and this is frequently done in practice.

Interpolation and vector-valued functions

If $n$ is not too large and $f$ is smooth, a natural approach to approximating $f$ is interpolation. That is, let $μ_{1}, \dots, μ_{k} \in R^{n}$ be sample points and let $L_{1} (μ), \dots, L_{k} (μ)$ be the Lagrange functions (or cardinal functions) forming a $k$ -dimensional basis for an approximation space $V$ (with $\dim (V) = k$ ).

The Lagrange functions, we might recall, have the property that $L_{i} (μ_{j}) = δ_{i j};$ we remind ourselves of this now, as it will be important again later. If we start off with an arbitrary basis $v_{1}, \dots, v_{k}$ for $V$ , we can compute the Lagrange functions as $L_{j} (μ) = [\begin{matrix} v_{1} (μ) & \dots & v_{k} (μ) \end{matrix}] {[\begin{matrix} v_{1} (μ_{1}) & v_{2} (μ_{1}) & \dots & v_{k} (μ_{1}) \\ v_{1} (μ_{2}) & v_{2} (μ_{2}) & \dots & v_{k} (μ_{2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ v_{1} (μ_{k}) & v_{2} (μ_{k}) & \dots & v_{k} (μ_{k}) \end{matrix}]}^{- 1} e_{j},$ where $e_{j}$ is the $j$ th column of the identity. Then we can compute our approximation as $f (μ) \approx \hat{f} (μ) = \sum_{j = 1}^{k} f (μ_{j}) L_{j} (μ) .$ To bound the error in such approximations, we use the same techniques described last time. The error is within a factor of $max_{x} \sum_{j} | L_{j} (μ) |$ of the best possible in the space, and we can bound the latter in terms of Taylor expansions (for example).

The approximations $L_{j} (μ)$ depend on the space $V$ and the points ${μ_{j}}_{j = 1}^{k}$ . It does not depend on the particular basis we choose for $V$ . However, it does depend on the space $V$ ! With the same set of points, we can get many different interpolants associated with different approximation spaces (polynomial spaces, spline spaces, piecewise linear spaces, or others). Different interpolation schemes are appropriate under different regularity assumptions: high-degree polynomial interpolation might make sense for interpolating very smooth $f$ , but not for a function that is nondifferentiable because of an absolute value function (for example).

Independent of the exact space $V$ , any method based on interpolation through points ${μ_{1}, \dots, μ_{k}}$ will yield a result in the space $sp [f (μ_{1}), \dots, f (μ_{k})]$ . But we can extract approximations to $f (μ)$ from this subspace in other ways than interpolating. For example, suppose $f (μ)$ is defined implicitly by a set of equations, e.g. $A (μ) f (μ) = b;$ then we can choose an approximation $f (μ) = \sum_{j = 1}^{k} c_{j} f (μ_{j})$ by solving a proxy system with $k ≪ n$ unknowns, which we might be able to do much more cheaply than solving a new system from scratch. This suggests that if we have some way to evaluate the quality of a solution without a full evaluation (e.g. by looking at the residual norm for some defining equation), we might be able to use the subspace of approximations accessed by interpolation (which has at most dimension $k$ ), but with the coefficients determined by a reduced system of equations rather than by interpolation.

Learning from snapshots

If we are going to build a good approximating subspace from evaluation of $f$ at some sample points (sometimes called snapshots), we need a good method for choosing those sample points. Some possible methods include:

Choose $k$ points on some type of regular grid.
Choosing $k$ points at random according to some distribution (e.g. uniform over the domain $Ω$ ).
Choose $k$ points according to a low-discrepancy sequence.
Choose $k$ points according to an experimental design method (e.g. using a Latin hypercube).

Sampling on regular grids must be done with some care to avoid aliasing issues when $f$ has some periodicity to it. And random sampling tends to produce some “clumps” in our domain, and so we may get more evently spread points from a low-discrepancy sequence generator.

Frequently, when we look at a collection of snapshots, we will discover that they have some redundancy to them. In fact, we hope they do! The POD (or principle orthogonal directions) approach to reducing a snapshot collection basically involves an SVD: $[\begin{matrix} f (μ_{1}) & \dots & f (μ_{k}) \end{matrix}] = U Σ V^{T} .$ If there is an $r$ -dimensional subspace of the $k$ -dimensional snapshot space that does a good job of capturing all the snapshots, we will have a small value for $σ_{r + 1}$ , and the truncated factor $U_{r}$ provides an orthonormal basis for the relevant space.

Taking the SVD of a snapshot matrix can be seen as a discrete approximation of computing the eigendecomposition of $C = \int_{Ω} f (μ) f (μ)^{T} ρ (μ) d μ$ for which the dominant $r$ eigenvectors span the “optimal” $r$ -dimensional invariant subspace for approximating general $f (μ)$ in the domain. In principle, we could try to approximate the difference between the integral and the finite sum (based on the smoothness of $f$ ), but we usually don’t bother in practice. If we have enough samples to reasonably cover the parameter space, we typically assume a small value $σ_{r + 1}$ is good evidence that the whole image $f (Ω)$ is well approximated by $U_{r}$ .

Empirical interpolation method (EIM)

For constructing a low-dimensional space from which to draw approximation, an alternative to principle orthogonal decomposition is the empirical interpolation method (EIM). The EIM algorithm is a greedy method that simultaneously constructs a set of interpolation points (the so-called “magic points”) and a basis for a subspace spanned by associated samples of the function.

For this case, consider $f : X \times Ω \to R$ , where $X$ is essentially an index set. For fixed $μ \in Ω$ , we will write $f (x, μ) = f_{μ} (x)$ . At each step, the EIM algorithm solves the problem $x_{j + 1}, μ_{j + 1} = argmax | f (x, μ) - {\hat{f}}_{j} (x, μ) |$ where ${\hat{f}}_{j} (x, μ)$ is the interpolant for $f$ based on ${(x_{i}, μ_{i})}_{i = 1}^{j}$ . Along the way, one builds up a basis of functions for the nested interpolation spaces based on scaled error functions, i.e. $h_{j} (x, μ) = \frac{f (x, μ_{j + 1}) - {\hat{f}}_{j} (x, μ_{j + 1})}{f (x_{j + 1} j, μ_{j + 1}) - {\hat{f}}_{j} (x_{j + 1}, μ_{j + 1})} .$ Readers familiar with univariate interpolation may recognize this as similar to the construction of the Newton basis for polynomial interpolation. The algorithm terminates when the maximum error falls below some tolerance.

Those readers who are afficianados of numerical linear algebra may recognize another algorithm lurking in the shadows. To make this more explicit, note that the interpolant at step $j$ is ${\hat{f}}_{j} (x, μ) = [\begin{matrix} h_{1} (x) & h_{2} (x) & \dots & h_{j} (x) \end{matrix}] [\begin{matrix} c_{1} (μ) \\ c_{2} (μ) \\ ⋮ \\ c_{j} (μ) \end{matrix}],$ where $[\begin{matrix} 1 \\ h_{1} (x_{2}) & 1 \\ h_{1} (x_{3}) & h_{2} (x_{3}) & 1 \\ ⋮ & ⋱ \\ h_{1} (x_{j}) & h_{2} (x_{j}) & \hdots & h_{j - 1} (x_{j}) & 1 \end{matrix}] [\begin{matrix} c_{1} (μ) \\ c_{2} (μ) \\ ⋮ \\ c_{j} (μ) \end{matrix}] = [\begin{matrix} f (x_{1}, μ) \\ f (x_{2}, μ) \\ ⋮ \\ f (x_{j}, μ) \end{matrix}],$ where we are using the fact (which holds by construction) that $h_{i} (x_{i}) = 1$ and $h_{i} (x_{ℓ}) = 0$ for $ℓ < i$ . We also observe that $c_{i} (μ_{j}) = 0$ for $i > j$ , again by construction.

If the sets $X$ and $Ω$ are discrete, what we have done is to write $P F Q \approx [\begin{matrix} H_{11} \\ H_{21} \end{matrix}] [\begin{matrix} C_{11} & C_{12} \end{matrix}]$ where $H_{11}$ is unit lower triangular and $C_{11}$ is upper triangular, and the permutation matrices $P$ and $Q$ reorder $X$ so that $x_{1}, \dots, x_{j}$ and $μ_{1}, \dots, μ_{j}$ appear first in the ordering of variables. Then we seek the largest magnitude element of $P F Q - [\begin{matrix} H_{11} \\ H_{21} \end{matrix}] [\begin{matrix} C_{11} & C_{12} \end{matrix}] = [\begin{matrix} 0 & 0 \\ 0 & (P F Q)_{22} - H_{21} C_{12} \end{matrix}]$ in order to obtain $x_{j + 1}$ and $μ_{j + 1}$ and extend the factorization. This is exactly the procedure for Gaussian elimination with complete pivoting.

For reasons that I don’t entirely understand, the interpretation of the EIM basis construction as Gaussian elimination with complete pivoting (GCP) doesn’t seem to appear in the literature¹, though a number of authors have come close. The closest I have found to making this explicit is the idea of a “continuous matrix” version of GECP was described in a 2014 SIREV paper by Townsend and Trefethen, who comment that “we are not aware of explicit previous discussions of LU factorization of a cmatrix.” A 2016 SISC paper by Drmač and Gugercin suggests the use of GECP in a similar context, but applied to the POD basis, and with an eye only to selecting interpolation points (rather than selecting a subspace basis); for this purpose, the pivoted QR approach that they analyze in detail (leading to the Q-DEIM method) seems preferable.

Extracting solutions

The POD method and the EIM method give us a subspace from which to draw approximations to $f (μ)$ . However, they do not tell us how to actually choose the approximations. Of course, we can use interpolation; but, as discussed earlier, we might prefer to use a method that consults with some defining set of equations.

For the sake of concreteness, let’s again consider the problem of approximating the function $f (μ)$ implicitly defined by the equation $A (μ) f (μ) = b .$ We will seek $\hat{f} (μ) \in V$ to approximately satisfy this equation. But what does it mean to “approximately” satisfy the equation? There are several possible ways we might approach this:

Least squares: We might choose $\hat{f} (μ) = {argmin}_{y \in V} ∥ A (μ) y - b ∥^{2}$ . Provided we have a basis $V$ for our space, this requires forming $A (μ) V \in R^{m \times k}$ (with $k$ matrix-vector products), solving a least squares problem (in $O (m k^{2})$ time), and then performing a matrix-vector product to reconstruct the solution (in $O (m k)$ time). If $A$ is dense, the dominant cost is the time to compute $A (μ) V$ ; this may or may not be true if $A$ is sparse.
Galerkin: We might choose a space $W$ and enforce the condition $A (μ) y - b ⊥ W$ . This is known as a Bubnov-Galerkin condition when $W = V$ ; otherwise, it is a Petrov-Galerkin condition. Unless $A$ has some special structure, the costs are similar to the least squares approach, with the greatest cost being the formation of $W^{T} A (μ) V$ (the “projected system matrix”), involving $k$ matrix-vector products and an $O (m k^{2})$ cost for a matrix-matrix multiplication. Note that if $A$ is a linear function of $μ$ (or a linear function of a few nonlinear functions of $μ$ ), we might be able to do some pre-computation to get the online cost down to $O (k^{3})$ .
Interpolation: We might choose $k$ equations to enforce; the indices of these equations are our “interpolation points” in the discrete setting. In this case, we might be able to avoid the cost of computing $A (μ) V$ , since we only need a few rows. Note that this is a special case of Galerkin where the test basis $W$ consists of columns of the identity associated with the indices for the enforced equations. The challenge for this problem is choosing a good set of interpolation points (in the EIM setting, these are the points selected during the subspace construction).

The error analysis of each of these methods follows the same general quasi-optimality recipe: we establish a bound on the minimum distance between $f (μ)$ and $V$ , then show that $\hat{f} (μ)$ is within some constant factor of that best distance. The quasi-optimality factor depends on the stability of the projected problem, and can often be computed explicitly.

Addendum: Low-discrepancy sequences

Consider the problem of sampling on the unit hypercube, whether for quadrature or interpolation. One approach to distributing sample points is to take independent draws from the uniform distribution; however, random draws tend to include “clumps” (and leave some parts of the domain empty). What we would really like is a sampling scheme that ensures that points are reasonably spread out (like what happens with grids) and yet is “random-seeming” and not prone to aliasing issues (which is an advantage of random draws).

Discrepancy is a measure of the “clumpiness” of a distribution of sample points². A number of low-discrepancy sequences are available; these include Halton sequences, Sobol sequences, and van der Corput sequences, for example. These sequences can be generated rapidly (they mostly involve simple recurrences), and there is good software available for them. Sampling via low-discrepance sequences (vs independent random draws) is the key distinction between quasi-Monte Carlo methods and Monte Carlo methods (and the reason that QMC tends to converge faster than standard MC).

Footnotes

At least, two days of searching didn’t turn it up↩︎
Actually, there are several measures that go under the common heading of “discrepancy,” but all are getting at the same thing↩︎

Other Formats