Lecture 13: Kernels

Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)

Consider the following example:

x = (\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{d} \end{matrix})

, and define

ϕ (x) = (\begin{matrix} 1 \\ x_{1} \\ ⋮ \\ x_{d} \\ x_{1} x_{2} \\ ⋮ \\ x_{d - 1} x_{d} \\ ⋮ \\ x_{1} x_{2} \dots x_{d} \end{matrix})

This new representation,

ϕ (x)

, is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick

Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without ever computing a single vector

ϕ (x)

or ever computing the full vector

w

. It is a little magical.

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:

\begin{aligned} w_{1} = & w_{0} - s \sum_{i = 1}^{n} 2 (w_{0}^{⊤} x_{i} - y_{i}) x_{i} = \sum_{i = 1}^{n} α_{i}^{0} x_{i} - s \sum_{i = 1}^{n} γ_{i}^{0} x_{i} = \sum_{i = 1}^{n} α_{i}^{1} x_{i} & (with α_{i}^{1} = α_{i}^{0} - s γ_{i}^{0}) \\ w_{2} = & w_{1} - s \sum_{i = 1}^{n} 2 (w_{1}^{⊤} x_{i} - y_{i}) x_{i} = \sum_{i = 1}^{n} α_{i}^{1} x_{i} - s \sum_{i = 1}^{n} γ_{i}^{1} x_{i} = \sum_{i = 1}^{n} α_{i}^{2} x_{i} & (with α_{i}^{2} = α_{i}^{1} x_{i} - s γ_{i}^{1}) \\ w_{3} = & w_{2} - s \sum_{i = 1}^{n} 2 (w_{2}^{⊤} x_{i} - y_{i}) x_{i} = \sum_{i = 1}^{n} α_{i}^{2} x_{i} - s \sum_{i = 1}^{n} γ_{i}^{2} x_{i} = \sum_{i = 1}^{n} α_{i}^{3} x_{i} & (with α_{i}^{3} = α_{i}^{2} - s γ_{i}^{2}) \\ \dots & \dots & \dots \\ w_{t} = & w_{t - 1} - s \sum_{i = 1}^{n} 2 (w_{t - 1}^{⊤} x_{i} - y_{i}) x_{i} = \sum_{i = 1}^{n} α_{i}^{t - 1} x_{i} - s \sum_{i = 1}^{n} γ_{i}^{t - 1} x_{i} = \sum_{i = 1}^{n} α_{i}^{t} x_{i} & (with α_{i}^{t} = α_{i}^{t - 1} - s γ_{i}^{t - 1}) \end{aligned}

Formally, the argument is by induction.

w

is trivially a linear combination of our training vectors for

w_{0}

(base case). If we apply the inductive hypothesis for

w_{t}

it follows for

w_{t + 1}

Inner-Product Computation

Let's go back to the previous example,

ϕ (x) = (\begin{matrix} 1 \\ x_{1} \\ ⋮ \\ x_{d} \\ x_{1} x_{2} \\ ⋮ \\ x_{d - 1} x_{d} \\ ⋮ \\ x_{1} x_{2} \dots x_{d} \end{matrix})

The inner product

ϕ (x)^{⊤} ϕ (z)

can be formulated as:

ϕ (x)^{⊤} ϕ (z) = 1 \cdot 1 + x_{1} z_{1} + x_{2} z_{2} + \dots + x_{1} x_{2} z_{1} z_{2} + \dots + x_{1} \dots x_{d} z_{1} \dots z_{d} = \prod_{k = 1}^{d} (1 + x_{k} z_{k}) .

The sum of

2^{d}

terms becomes the product of

d

terms. We can compute the inner-product from the above formula in time

O (d)

instead of

O (2^{d})

! We define the function

\underset{this is called the kernel function}{\underset{⏟}{k (x_{i}, x_{j})}} = ϕ (x_{i})^{⊤} ϕ (x_{j}) .

With a finite training set of

n

samples, inner products are often pre-computed and stored in a Kernel Matrix:

K_{i j} = ϕ (x_{i})^{⊤} ϕ (x_{j}) .

If we store the matrix

K

, we only need to do simple inner-product look-ups and low-dimensional computations throughout the gradient descent algorithm. The final classifier becomes:

h (x_{t}) = \sum_{j = 1}^{n} α_{j} k (x_{j}, x_{t}) .

During training in the new high dimensional space of

ϕ (x)

we want to compute

γ_{i}

through kernels, without ever computing any

ϕ (x_{i})

or even

w

. We previously established that

w = \sum_{j = 1}^{n} α_{j} ϕ (x_{j})

, and

γ_{i} = 2 (w^{⊤} ϕ (x_{i}) - y_{i})

. It follows that

γ_{i} = 2 (\sum_{j = 1}^{n} α_{j} K_{i j}) - y_{i})

. The gradient update in iteration

t + 1

becomes

α_{i}^{t + 1} \leftarrow α_{i}^{t} - 2 s (\sum_{j = 1}^{n} α_{j}^{t} K_{i j}) - y_{i}) .

As we have

n

such updates to do, the amount of work per gradient update in the transformed space is

O (n^{2})

--- far better than

O (2^{d})

General Kernels

(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality

d

of the data is high.)

Radial Basis Function (RBF) (aka Gaussian Kernel):

K (x, z) = e^{\frac{- ∥ x - z ∥^{2}}{σ^{2}}}

The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see this paper).

Kernel functions

Definition: A matrix

A \in R^{n \times n}

is positive semi-definite iff

\forall q \in R^{n}

q^{⊤} A q \geq 0

Remember

K_{i j} = ϕ (x_{i})^{⊤} ϕ (x_{j})

. So

K = Φ^{⊤} Φ

, where

Φ = [ϕ (x_{1}), \dots, ϕ (x_{n})]

. It follows that

K

is p.s.d., because

q^{⊤} K q = (Φ^{⊤} q)^{2} \geq 0

. Inversely, if any matrix

A

is p.s.d., it can be decomposed as

A = Φ^{⊤} Φ

for some realization of

Φ

Handcrafted Feature Expansion

The Kernel Trick

Gradient Descent with Squared Loss

Inner-Product Computation

General Kernels

Kernel functions