Assumption: Nearby inputs have similar outputs.
Hypothesis: Given a test input $x$, output the most common label among its $k$ most-similar training inputs.
What should we do in case of a tie?
Most common distance: the Euclidean distance $$\operatorname{dist}(u,v) = \| u - v \|_2 = \sqrt{ \sum_{i=1}^d (u_i - v_i)^2 }$$
Also popular: the taxicab norm (a.k.a. Manhattan norm) $$\operatorname{dist}(u,v) = \| u - v \|_1 = \sum_{i=1}^d |u_i - v_i|$$
For parameter $p \ge 1$
$$\operatorname{dist}(u,v) = \| u - v \|_p = \left( \sum_{i=1}^d |u_i - v_i|^p \right)^{1/p}$$Generalizes many other norms, including the popular $\ell_2$ (Euclidean), $\ell_1$ (taxicab), and $\ell_{\infty}$ (max norm).
The Bayes Optimal Classifier is the hypothesis
$$h_{\operatorname{opt}}(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y | x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(x, y).$$Conditional probability of damage $x$ conditioned on label $y$.
Joint density: $\mathcal{P}(x,y) = \mathcal{P}(x | y) \mathcal{P}(y) = \mathcal{P}(x | y) \cdot \frac{1}{2}$
Damage | Wolf (1d6+1) | Werewolf (2d4) |
2 | 1/12 | 1/32 |
3 | 1/12 | 1/16 |
4 | 1/12 | 3/32 |
5 | 1/12 | 1/8 |
6 | 1/12 | 3/32 |
7 | 1/12 | 1/16 |
8 | 0 | 1/32 |
Always guess the label with the highest probability.
2 | 1/12 | 1/32 | Wolf |
3 | 1/12 | 1/16 | Wolf |
4 | 1/12 | 3/32 | Werewolf |
5 | 1/12 | 1/8 | Werewolf |
6 | 1/12 | 3/32 | Werewolf |
7 | 1/12 | 1/16 | Wolf |
8 | 0 | 1/32 | Werewolf |
Does this being "optimal" mean we get it right all the time?
We get it wrong when the true label disagrees with our prediction.
$\operatorname{error} = \frac{1}{32} + \frac{1}{16} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{16} + 0 = \frac{13}{32} \approx 41\%.$
Another important baseline is the Best Constant Predictor.
$$h(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y).$$We can bound the error of 1-NN relative to the Bayes Optimal Classifier.
Suppose that $(\mathcal{X}, \operatorname{dist})$ is a separable metric space.
Let $x_{\text{test}}$ and $x_1, x_2, \ldots$ be independent identically distributed random variables over $\mathcal{X}$. Then almost surely (i.e. with probability $1$)
$$\lim_{n \rightarrow \infty} \; \arg \min_{x \in \{x_1, \ldots, x_n\}} \operatorname{dist}(x, x_{\text{test}}) = x_{\text{test}}.$$Consider the case where any ball of radius $r$ centered around $x_{\text{test}}$ has positive probability.
Then, no matter now close the current nearest neighbor is to $x_{\text{test}}$, every time we draw a fresh sample $x_i$ from the source distribution, with some probability it will be closer than the nearest neighbor currently in the distribution.
This implies that the distance diminishes to $0$ with probability $1$.
Consider the case where there is some ball of radius $r$ centered around $x_{\text{test}}$ that has probability zero in the source distribution.
But this must happen with zero probability in the random selection of $x_{\text{test}}$.
Why? Let $Z$ be the set of all points in $\mathcal{X}$ that have the property that they are the center of some ball with zero probability. Then because $\mathcal{X}$ is separable, we can cover $Z$ with the union of a countable number of balls with zero probability. So $Z$ itself must have zero probability.
Let $x_{\text{test}}$ denote a test point randomly drawn from $\mathcal{P}$. Let $\hat x_n$ (also a random variable) denote the nearest neighbor to $x_{\text{test}}$ in an independent training dataset of size $n$.
The expected error of the 1-NN classifier is $$ \operatorname{error}_{\text{1-NN}} = \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right]. $$ This is the sum over all labels $y$ of the probability that the prediction will be $y$ but the true label will not be $y$.
Taking the limit as $n$ approaches infinity, the expected error is \begin{align*} \lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} &= \lim_{n \rightarrow \infty} \; \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \\&= \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | x_{\text{test}})) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \end{align*}
Let $\hat y$ denote the prediction of the Bayes Optimal Classifier on $x_{\text{test}}$. \begin{align*} \lim{n \rightarrow \infty} \; \operatorname{error}{\text{1-NN}} &= \mathbf{E}\left[ \mathcal{P}(\hat y | x{\text{test}}) \left(1 - \mathcal{P}(\hat y | x{\text{test}})\right) \right] \&\hspace{2em}+ \mathbf{E}\left[ \sum{y \ne \hat y} \mathcal{P}(y | x{\text{test}})) \left(1 - \mathcal{P}(y | x{\text{test}})\right) \right] \&\le \mathbf{E}\left[ 1 \cdot \left( 1 - \mathcal{P}(\hat y | x{\text{test}}) \right) \right] + \mathbf{E}\left[ \sum{y \ne \hat y} 1 \cdot \mathcal{P}(y | x{\text{test}})) \right] \&=
2 \mathbf{E}\left[ 1 - \mathcal{P}(\hat y | x_{\text{test}})\right]
2 \operatorname{error}_{\text{Bayes}}.
k-NN works by reasoning about how close together points are.
In high dimensional space, points drawn from a distribution tend not to be close together.
First, let's look at some random points and a line in the unit square.
Consider what happens when we move to three dimensions. The points move further away from each other but stay equally close to the red hyperplane.
The pairwise distance between two points in a unit cube (or sphere, or from a unit Gaussian) increases with dimension.
In comparison, the distance to a hyperplane does not increase.
If the data lies in a low-dimensional submanifold, then we can still use low-dimensional methods even in higher dimensions.
