Still looking at our first principle from the intro lecture...
Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms that update the model iteratively with easy-to-compute steps using numerical linear algebra.
Many, if not most, machine learning training algorithms use gradients to optimize a function.
What is a gradient?
Suppose I have a function $f$ from $\R^d$ to $\R$. The gradient, $\nabla f$, is a function from $\R^d$ to $\R^d$ such that
$$\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta},$$that is, it is the vector of partial derivatives of the function.
Another, perhaps cleaner (and basis-independent), definition is that $\nabla f(w)^T$ is the linear map such that for any $\Delta \in \R^d$
$$\nabla f(w)^T \Delta = \lim_{a \rightarrow 0} \frac{f(w + a \Delta) - f(w)}{a}.$$More informally, it is the unique vector $\nabla f(w_0)$ such that $f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0)$ for $w$ nearby $w_0$.
$f(x) = x^T A x$
$f(x) = \| x \|_2^2 = \sum_{i=1}^d x_i^2 = x^T x = x^T I x$
$f(x) = \| x \|_1 = \sum_{i=1}^d | x_i |$
What we just did is called symbolic differentiation.
The naive symbolic differentiation approach has a couple of problems.
Imagine that we have some functions $f$, $g$, and $h$, and we are interested in computing the derivative of the function $f(g(h(x)))$ with respect to $x$. By the chain rule, $$
\frac{d}{dx} f(g(h(x)))
=
f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x).
$$ Imagine that we just copy-pasted this math into python. We'd get
def grad_fgh(x):
return grad_f(g(h(x))) * grad_g(h(x)) * grad_h(x)
This code recomputes $h(x)$ twice!
To avoid this, we could imagine writing something like this instead:
def better_grad_fgh(x):
hx = h(x)
return grad_f(g(hx)) * grad_g(hx) * grad_h(x)
This code only computes $h(x)$ once. But we had to do some manual work to pull out common subexpressions, work that goes beyond what the plain chain rule gives us.
Now you might ask the very reasonable question: So what? We just recomputed one function. Could this really have a significant impact on performance?
To answer this question, suppose that we now have functions $f_1, f_2, \ldots, f_k$, and consider differentiating the more complicated function $f_k(f_{k-1}(\cdots f_2(f_1(x)) \cdots))$. The chain rule will give us \begin{align*} &\frac{d}{dx} f_k(f_{k-1}(\cdots f_2(f_1(x)) \cdots)) \\&\hspace{2em}= f_k'(f_{k-1}(\cdots f_2(f_1(x)) \cdots)) \cdot f_{k-1}'(f_{k-2}(\cdots f_2(f_1(x)) \cdots)) \\&\hspace{4em}\cdots f_2'(f_1(x)) \cdot f_1'(x). \end{align*} If each function $f_i$ or $f_i'$ takes $O(1)$ time to compute, how long can this derivative take to compute if
Computing a derivative by just applying the chain rule and then naively copying the expression into code can be asymptotically slower than other methods.
Therefore, for small $\epsilon > 0$,
$$\nabla f(w)^T \Delta \approx \frac{f(w + \epsilon \Delta) - f(w)}{\epsilon}.$$It's often better to use the symmetric difference quotient instead:
$$\nabla f(w)^T \Delta \approx \frac{f(w + \epsilon \Delta) - f(w - \epsilon \Delta)}{2 \epsilon}.$$import math
def f(x):
return 2*x*x*x-1
def dfdx(x):
return 6*x*x
def numerical_derivative(f, x, eps = 1e-4):
return (f(x+eps) - f(x-eps))/(2*eps)
print(dfdx(2.0))
print(numerical_derivative(f, 2.0))
24.0 24.00000002002578
Compute derivatives automatically without any overheads or loss of precision.
Two rough classes of methods:
import numpy as np
def to_dualnumber(x):
if isinstance(x, DualNumber):
return x
elif isinstance(x, float):
return DualNumber(x)
elif isinstance(x, int):
return DualNumber(float(x))
else:
raise Exception("couldn't convert {} to a dual number".format(x))
class DualNumber(object):
def __init__(self, y, dydx=0.0):
super().__init__()
self.y = y
self.dydx = dydx
def __repr__(self):
return "(y = {}, dydx = {})".format(self.y, self.dydx)
# operator overloading
def __add__(self, other):
other = to_dualnumber(other)
return DualNumber(self.y + other.y, self.dydx + other.dydx)
def __sub__(self, other):
other = to_dualnumber(other)
return DualNumber(self.y - other.y, self.dydx - other.dydx)
def __mul__(self, other):
other = to_dualnumber(other)
return DualNumber(self.y * other.y, self.dydx * other.y + self.y * other.dydx)
def __pow__(self, other):
other = to_dualnumber(other)
return DualNumber(self.y ** other.y,
self.dydx * other.y * self.y ** (other.y - 1) + other.dydx * (self.y ** other.y) * np.log(self.y))
def __radd__(self, other):
return to_dualnumber(other).__add__(self)
def __rsub__(self, other):
return to_dualnumber(other).__sub__(self)
def __rmul__(self, other):
return to_dualnumber(other).__mul__(self)
def exp(x):
if isinstance(x, DualNumber):
return DualNumber(np.exp(x.y), np.exp(x.y) * x.dydx)
else:
return np.exp(x)
def sin(x):
if isinstance(x, DualNumber):
return DualNumber(np.sin(x.y), -np.cos(x.y) * x.dydx)
else:
return np.sin(x)
def forward_mode_diff(f, x):
x_dual = DualNumber(x, 1.0)
return f(x_dual).dydx
print(dfdx(3.0))
print(numerical_derivative(f, 3.0))
print(forward_mode_diff(f, 3.0))
54.0 54.00000002012462 54.0
# def g(x):
# return 12 + 0*x
# print(numerical_derivative(g, 2.0))
# print(forward_mode_diff(g, 2.0))
def h(x):
return x**5.0
def dhdx(x):
return 5*x**4
print(dhdx(2.0))
print(forward_mode_diff(h, 2.0))
80.0 80.0
Problem: can't just compute derivatives as-we-go like with forward mode AD.