CS 6241: Numerics for Data Science

Function approximation fundamentals

Author

David Bindel

Published

March 20, 2025

Introduction

Our goal in the coming three weeks largely involves approximating functions f:ΩRnRm by some “simple” family of functions (particularly polynomials and combinations of translates of a radial basis function). If we want to prove something, we will usually assume Ω is compact and f has some degree of smoothness or regularity.

Approximation of continuous functions f:[a,b]R plays a key role in most introductions to analysis, whether classical or numerical. On the classical side, we have the Weierstrass theorem, stating that every such function can be uniformly approximated by a polynomial; equivalently, ϵ>0,pP:fpϵ, where in this setting u=max[a,b]|u(x)|. Work by Jackson and Bernstein made Weierstrass’s results more precise, connecting the degree of smoothness to the polynomial degree required to achieve a given accuracy. These types of results are generally referred to as constructive function theory, and are well described in now-classic books by Akhieser and by Rivlin.

In numerical analysis, most students are first exposed to function approximation in discussions of polynomial interpolation: given function values y0,,yd at nodes x0,,xd, find a polynomial pPd (i.e. a polynomial with degree at most d) such that p(xj)=yj for j=0,,d. The study of polynomial interpolation stands on two legs: on the computational side, one needs a representation of the interpolating polynomial that can be computed (and evaluated at new points) quickly and without numerical instability; and on the theoretical side, one needs an understanding of how well polynomial interpolation approximates a target function y=f(x) when f has some known smoothness. The theory and computational practice come together in the design of adaptive approximation algorithms; the Chebfun system is a modern instance of tying these threads thoroughly together.

Alas, approximating univariate functions by polynomials is not enough for our purposes, and things immediately become more complicated when we move to higher-dimensional spaces. When we go beyond one space dimension, we can no longer always uniquely define an interpolant from an n+1-dimensional linear space of functions with samples at n+1 data points; we need an additional condition in order to ensure the problem is well-posed (the famous Mairhuber-Curtis theorem). We can get around this issue by constraining the location of the sample points or by choosing a method that goes beyond taking approximants from a fixed linear space. At a computational level, high-dimensional approximation suffers from the “curse of dimensionality”: if all we have is a fixed amount of smoothness, the number of data points required to reach a target accuracy tends to grow exponentially in the dimension of the space. Therefore, we tend to seek methods that effectively lower the problem dimension, as we discuss in the coming week.

A concrete error bound

Some of this lecture will be rather abstract. In order to ground ourselves, let’s start with a very concrete problem. Consider n+1 points x0,,xnRn with associated function values yj=f(xj)R. Assuming f has bounded second derivatives, how can we bound |f(x)p(x)| where p(x)=cTx+d is the linear interpolant through the data points?

Even in this simple case, we need an extra hypothesis to make sense of the problem; the system of equations [x0T1x1T1xnT1][cd]=y should not be singular. Partial Gaussian elimination gives that this is equivalent to the condition that x1x0,,xnx0 be a basis for Rn.

Assuming that this system is nonsingular, we can evaluate p(x) at a given target point x by either solving the linear system for the coefficients c and d or solving the system [x0x1xn11xn]w=[x1] and then evaluating p(x)=wTy. If x is in the convex hull of the xi, then the weights w are all non-negative.

This latter form is convenient for error analysis, together with the trick of writing each of the function evaluations in terms of a Taylor expansion about x; if we let zj=xjx, then f(xj)=f(x)+f(x)zj+12zjTf(ξj)zj. where ξj is some point on the line segment connecting x and xj. Substituting in, we have p(x)=jwjf(x)+jwjf(x)zj+12jwjzjTf(ξj)zj; and from the defining equations for w, we have that jwj=1,jwjzj=0, so that we are left with p(x)f(x)=12jwjzjTf(ξj)zj Therefore, if fM uniformly, we have |p(x)f(x)|12(j|wj|)(maxjMzj2). If x is in the convex hull of the data locations, we have that the weights are all positive and sum to one, and the distances zj are bounded by the maximum edge length d=maxk,xkx, so that |p(x)f(x)|12Md2 where d is the maximum edge length xixj. In fact, we can tighten the constant somewhat, though not enormously.

There are a few things to take away from this concrete example (apart from the error bound, which is useful in its own right):

  • We need a hypothesis on the points at the outset to ensure that we can solve the interpolation system (this property of the points is sometimes called well-poisedness for interpolation).

  • There are multiple ways of formulating the interpolation problem (solve for the coefficients in an affine function, solve for weights associated with a given point).

  • The fact that the method exactly reconstructs linear functions played an important (if implicit) role in our error analysis.

Let’s now consider how we would tackle something like this in a more abstract setting.

Approximation from a fixed space

Consider a Banach space F (a complete, normed vector space) containing a target function f, and suppose we seek to approximate f by some function vV where VF is a finite-dimensional subspace. We now have two distinct questions:

  • Does V contain a good approximation to the target function?

  • If a good approximation exists, how can it be found?

There are several different ways that we might choose a good approximation, including minimizing a squared error (in the case that F is a Hilbert space); including interpolating at a set of points; forcing certain linear functionals of the error to be zero (a Petrov-Galerkin approach); minimizing a (weighted) squared error at a larger set of sample points; or minimizing the maximum error over a large set of points (leading to a so-called Chebyshev optimization problem).

For the moment, we consider approximation schemes that are linear in f and are exact on V. This includes all of the schemes mentioned above except the last (Chebyshev optimization leads to a nonlinear scheme). In these methods, we approximate f by Pf where P is a linear operator with range V and PV=V. These two conditions imply that the approximation operator P is a projection, i.e. P2=P. We also have an associated error projection IP. Now, note that for any vV, we have fPf=(IP)(fv)IPfv, and therefore fPfIPinfvVfv. Hence, approximation via f is quasi-optimal, yielding an approximation within a factor of IP of the best possible approximation within the space.

Making this a little more concrete, suppose we use the max norm on Ω, i.e. f=supxΩ|f(x)|. Then IP1+P and P is the maximum over Ω of the so-called Lebesgue function Λ(x)=j=1n|vj(x)| where |vj(x)| are the so-called cardinal functions (or Lagrange functions) for which vj(xi)=δij.

Kolmogorov n-widths and convergence rates

The quasi-optimality framework described above only helps with half of our problem. We want an approximation within a constant factor of the best thing possible; we also want the best thing possible to be good! This is an instance of the consistency-stability paradigm common in numerical analysis: stability gives a small quasi-optimality constant, consistency means that the best possible approximation has a small error. Having briefly described the stability piece, let’s now talk about accuracy.

Let K be a (usually compact) subset of F. When F has a norm (or quasi-norm) that measures smoothness, the set K may be associated with a unit ball of functions up to a prescribed smoothness. In our concrete case described before, for instance, we might consider the set of all functions where the second derivative was pointwise bounded in norm by some constant M. The Kolmogorov n-width describes the worst-case approximation error for functions from K under a best-case choice of n-dimensional approximation spaces of functions; that is, dn(K):=infdim(Vn)=nsupfKE(f,Vn), where E(f,Vn) is the minimum error norm for approximating f by elements of Vn. In general, these n-widths decay as O(nα) for some α associated with smoothness of the functions in K.