Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Notebook for 2023-02-03

md"""
# Notebook for 2023-02-03
"""
137 μs

Representations

Normalized numbers

In the notes, we said 1/3 as a normalized binary number looks like

$$(-1)^{0} \times (1.010101...)_2 \times 2^{-2}$$

The bit string for this number involves:

  • A leading sign bit (which is 0 for a positive number)

  • An 11-bit exponent (01111111101) that represents the true exponent + 1023. The cases where all the bits are zero or are all one represent subnormals and infinity/NaN representations, respectively.

  • The remaining 52 bits 0101010101010101010101010101010101010101010101010101 represent the significant after the binary point. We don't need to store the leading digit, because it is always one for a normalized binary number.

md"""
## Representations

### Normalized numbers

In the notes, we said 1/3 as a normalized binary number looks like

$$(-1)^{0} \times (1.010101...)_2 \times 2^{-2}$$

The bit string for this number involves:

- A leading sign bit (which is 0 for a positive number)
- An 11-bit exponent (`01111111101`) that represents the true exponent + 1023. The cases where all the bits are zero or are all one represent subnormals and infinity/NaN representations, respectively.
- The remaining 52 bits `0101010101010101010101010101010101010101010101010101` represent the significant *after* the binary point. We don't need to store the leading digit, because it is always one for a normalized binary number.
"""
2.4 ms
bitstring(1.0/3.0)
159 ms
bitstring(1.0/3.0)[1] # Sign big
---
bitstring(1.0/3.0)[2:12] # Exponent bits
---
bitstring(1.0/3.0)[13:end] # Significand bits after the binary point
---

Subnormals

The subnormal numbers (64-bit format) look like

$$(-1)^s \times (0.b_1 b_2 \ldots b_{52})_2 \times 2^{-1023}$$

where $s$ is the sign bit, $b_1$ through $b_52$ are the significand bits, and the exponent of $-1023$ is represented by the all-zeros bit pattern in the exponent field. A good nonzero example is $2^-1025$.

md"""
### Subnormals

The subnormal numbers (64-bit format) look like

$$(-1)^s \times (0.b_1 b_2 \ldots b_{52})_2 \times 2^{-1023}$$

where $s$ is the sign bit, $b_1$ through $b_52$ are the significand bits, and the exponent of $-1023$ is represented by the all-zeros bit pattern in the exponent field. A good nonzero example is $2^-1025$.
"""
231 μs
bitstring(2.0^-1025)
---
bitstring(2.0^-1025)[2:12]
---
bitstring(2.0^-1025)[13:end]
---

A number like 0 is subnormal, so all the exponent bits (as well as all the other bits) will be zero.

md"""
A number like 0 is subnormal, so all the exponent bits (as well as all the other bits) will be zero.
"""
130 μs
bitstring(0.0)
---

We do have the possibility of -0, which has a one for the sign bit (and everything else zero).

md"""
We do have the possibility of -0, which has a one for the sign bit (and everything else zero).
"""
137 μs
bitstring(-0.0)
---

Infinities

Infinities are represented by the all-ones exponent pattern and a significand field which is all zeros.

md"""
### Infinities

Infinities are represented by the all-ones exponent pattern and a significand field which is all zeros.
"""
174 μs
bitstring(Inf)
---
bitstring(Inf)[2:12] # Exponent bits
---
bitstring(Inf)[13:end] # Significand bits
---

NaNs

Not-a-Number (NaN) representations share the same all-ones exponent pattern, but can encode additional data in the (nonzero) significand bits.

md"""
### NaNs

Not-a-Number (NaN) representations share the same all-ones exponent pattern, but can encode additional data in the (nonzero) significand bits.
"""
176 μs
bitstring(0.0/0.0)
---
bitstring(0.0/0.0)[13:end]
---

Illustrated floating point mishaps

md"""
## Illustrated floating point mishaps
"""
120 μs

Cancellation

The standard example here is the smaller root of a quadratic

$$z^2 - 2z + \epsilon = 0$$

Thinking of the smaller root $z_-$ as a function of $\epsilon$, we can use implicit differentiation to find

$$2(z-1) \frac{dz}{d\epsilon} + 1 = 0$$

which means that near zero, we expect

$$z_- = \frac{\epsilon}{2} + O(\epsilon^2)$$

Unfortunately, the naive formula utterly fails us for small $\epsilon$.

md"""
### Cancellation

The standard example here is the smaller root of a quadratic

$$z^2 - 2z + \epsilon = 0$$

Thinking of the smaller root $z_-$ as a function of $\epsilon$, we can use implicit differentiation to find

$$2(z-1) \frac{dz}{d\epsilon} + 1 = 0$$

which means that near zero, we expect

$$z_- = \frac{\epsilon}{2} + O(\epsilon^2)$$

Unfortunately, the naive formula utterly fails us for small $\epsilon$.
"""
282 μs
test_smaller_root(ϵ) = 1.0 - sqrt(1.0 - ϵ)
---
---
---
---

The problem here is cancellation: error committed in computing $1-\epsilon$ is small relative to that quantity (which is around 1), but is big relative to the size of $z_-$ (which is wround $\epsilon/2$). An alternate formulation (the product divided by the larger root) does not involve subtracting things that are very close, and so does not suffer from amplification of error due to cancellation.

md"""
The problem here is cancellation: error committed in computing $1-\epsilon$ is small relative to that quantity (which is around 1), but is big relative to the size of $z_-$ (which is wround $\epsilon/2$). An alternate formulation (the product divided by the larger root) does not involve subtracting things that are very close, and so does not suffer from amplification of error due to cancellation.
"""
159 μs
test_smaller_root2(ϵ) = ϵ/(1.0 + sqrt(1.0 - ϵ))
---
---
---
---

Sensitive subproblems

In the notes, we described the case of taking many square roots followed by many squares. The problem is that the problem of taking many squares is extremely ill-conditioned, so that even a relative error of machine epsilon in the result of the first loop can lead to a very large error in the result of the second loop (even assuming no further rounding error – as is the case when the first loop computes exactly 1.0 in floating point!).

md"""
### Sensitive subproblems

In the notes, we described the case of taking many square roots followed by many squares. The problem is that the problem of taking many squares is extremely ill-conditioned, so that even a relative error of machine epsilon in the result of the first loop can lead to a very large error in the result of the second loop (even assuming no further rounding error -- as is the case when the first loop computes exactly 1.0 in floating point!).
"""
179 μs
function silly_sqrt(n=100)
x = 2.0
for k = 1:n
x = sqrt(x)
end
for k = 1:n
x = x^2
end
x
end
---
---
---
---
---

Exploding recurrences

Consider the computation of

$$E_n = \int_0^1 x^n e^{-x} \, dx$$

By inspection, we know that $0 \leq E_n \leq E_0 = 1-1/e$ for $n \geq 0$. An asymptotic argument gives us that for large $n$, we should have $E_n \approx 1/(e(n+1))$. And integration by parts gives the recurrence

$$E_n = 1-nE_{n-1}.$$

Unfortunately, if we run this recurrence forward, the error at each step gets multiplied by a factor of $n$, and things soon become hopelessly awful.

md"""
### Exploding recurrences

Consider the computation of

$$E_n = \int_0^1 x^n e^{-x} \, dx$$

By inspection, we know that $0 \leq E_n \leq E_0 = 1-1/e$ for $n \geq 0$. An asymptotic argument gives us that for large $n$, we should have $E_n \approx 1/(e(n+1))$. And integration by parts gives the recurrence

$$E_n = 1-nE_{n-1}.$$

Unfortunately, if we run this recurrence forward, the error at each step gets multiplied by a factor of $n$, and things soon become hopelessly awful.
"""
236 μs
function bad_recurrence(n)
E = 1-1/exp(1)
for j = 1:n
E = 1-j*E
end
E
end
---
---
---

A better approach is to run the same recurrence backward from our (crude) estimate.

md"""
A better approach is to run the same recurrence backward from our (crude) estimate.
"""
113 μs

function better_recurrence(n)
E = 1.0/exp(1.0)/(n+100)
for j = n+100:-1:n
E = (1-E)/(j+1)
end
E
end
---
---
---

Undetected underflow

The case we mentioned in the notes comes from Bayesian statistics. Suppose we want to compute the log determinant of a large matrix – we'll choose something simple.

md"""
### Undetected underflow

The case we mentioned in the notes comes from Bayesian statistics. Suppose we want to compute the log determinant of a large matrix -- we'll choose something simple.
"""
142 μs
using LinearAlgebra
279 μs
D = Diagonal(0.05*ones(500))
---

If we use the obvious formula, the product while underflow, leading us to take the log of zero.

md"""
If we use the obvious formula, the product while underflow, leading us to take the log of zero.
"""
113 μs
log(det(D))
---

Of course, this is not the right answer! If we recognize that the log of a product is the sum of the logs, we can easily figure out the true log determinant in this case.

md"""
Of course, this is not the right answer! If we recognize that the log of a product is the sum of the logs, we can easily figure out the true log determinant in this case.
"""
116 μs
500*log(0.05)
---

Bad branches

The key point here is that NaN fails all comparisons – it is not part of the usual ordering relations, and violates our expectations. Therefore, we can get into trouble with branches when NaNs are involved.

md"""
### Bad branches

The key point here is that NaN fails all comparisons -- it is not part of the usual ordering relations, and violates our expectations. Therefore, we can get into trouble with branches when NaNs are involved.
"""
147 μs
function test_negative(x)
if x < 0.0
"$(x) is negative"
elseif x >= 0.0
"$(x) is non-negative"
else
"$(x) is ... uh..."
end
end
---
---
---
test_negative(0.0/0.0)
---

Of course, there are some other subtleties here, too! For example, the floating point standard contains both positive and negative zeros, but this code will not distinguish between them (they are the same according to all comparison operations).

md"""
Of course, there are some other subtleties here, too! For example, the floating point standard contains both positive and negative zeros, but this code will not distinguish between them (they are the same according to all comparison operations).
"""
122 μs
---
---
copysign(1.0, -0.0)
---
copysign(1.0, 0.0)
---