Notebook for 2023-02-03
Representations
Normalized numbers
In the notes, we said 1/3 as a normalized binary number looks like
$$(-1)^{0} \times (1.010101...)_2 \times 2^{-2}$$
The bit string for this number involves:
A leading sign bit (which is 0 for a positive number)
An 11-bit exponent (
01111111101
) that represents the true exponent + 1023. The cases where all the bits are zero or are all one represent subnormals and infinity/NaN representations, respectively.The remaining 52 bits
0101010101010101010101010101010101010101010101010101
represent the significant after the binary point. We don't need to store the leading digit, because it is always one for a normalized binary number.
Subnormals
The subnormal numbers (64-bit format) look like
$$(-1)^s \times (0.b_1 b_2 \ldots b_{52})_2 \times 2^{-1023}$$
where $s$ is the sign bit, $b_1$ through $b_52$ are the significand bits, and the exponent of $-1023$ is represented by the all-zeros bit pattern in the exponent field. A good nonzero example is $2^-1025$.
A number like 0 is subnormal, so all the exponent bits (as well as all the other bits) will be zero.
We do have the possibility of -0, which has a one for the sign bit (and everything else zero).
Infinities
Infinities are represented by the all-ones exponent pattern and a significand field which is all zeros.
NaNs
Not-a-Number (NaN) representations share the same all-ones exponent pattern, but can encode additional data in the (nonzero) significand bits.
Illustrated floating point mishaps
Cancellation
The standard example here is the smaller root of a quadratic
$$z^2 - 2z + \epsilon = 0$$
Thinking of the smaller root $z_-$ as a function of $\epsilon$, we can use implicit differentiation to find
$$2(z-1) \frac{dz}{d\epsilon} + 1 = 0$$
which means that near zero, we expect
$$z_- = \frac{\epsilon}{2} + O(\epsilon^2)$$
Unfortunately, the naive formula utterly fails us for small $\epsilon$.
The problem here is cancellation: error committed in computing $1-\epsilon$ is small relative to that quantity (which is around 1), but is big relative to the size of $z_-$ (which is wround $\epsilon/2$). An alternate formulation (the product divided by the larger root) does not involve subtracting things that are very close, and so does not suffer from amplification of error due to cancellation.
Sensitive subproblems
In the notes, we described the case of taking many square roots followed by many squares. The problem is that the problem of taking many squares is extremely ill-conditioned, so that even a relative error of machine epsilon in the result of the first loop can lead to a very large error in the result of the second loop (even assuming no further rounding error – as is the case when the first loop computes exactly 1.0 in floating point!).
Exploding recurrences
Consider the computation of
$$E_n = \int_0^1 x^n e^{-x} \, dx$$
By inspection, we know that $0 \leq E_n \leq E_0 = 1-1/e$ for $n \geq 0$. An asymptotic argument gives us that for large $n$, we should have $E_n \approx 1/(e(n+1))$. And integration by parts gives the recurrence
$$E_n = 1-nE_{n-1}.$$
Unfortunately, if we run this recurrence forward, the error at each step gets multiplied by a factor of $n$, and things soon become hopelessly awful.
A better approach is to run the same recurrence backward from our (crude) estimate.
Undetected underflow
The case we mentioned in the notes comes from Bayesian statistics. Suppose we want to compute the log determinant of a large matrix – we'll choose something simple.
If we use the obvious formula, the product while underflow, leading us to take the log of zero.
Of course, this is not the right answer! If we recognize that the log of a product is the sum of the logs, we can easily figure out the true log determinant in this case.
Bad branches
The key point here is that NaN fails all comparisons – it is not part of the usual ordering relations, and violates our expectations. Therefore, we can get into trouble with branches when NaNs are involved.
Of course, there are some other subtleties here, too! For example, the floating point standard contains both positive and negative zeros, but this code will not distinguish between them (they are the same according to all comparison operations).